• Anthropic detects 'strategic manipulation' features in Claude Myt

    From TechnologyDaily@1337:1/100 to All on Wed Apr 8 14:15:27 2026
    Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness prompting concern over model behavior

    Date:
    Wed, 08 Apr 2026 13:00:00 +0000

    Description:
    New research from Anthropic shows early version of Claude Mythos can hide intent and even cheat without saying so

    FULL STORY ======================================================================Copy link Facebook X Whatsapp Reddit Pinterest Flipboard Threads Email Share this article 0 Join the conversation Follow us Add us as a preferred source on Google Newsletter Tech Radar Get daily insight, inspiration and deals in your inbox Sign up for breaking news, reviews, opinion, top tech deals, and more. Become a Member in Seconds Unlock instant access to exclusive member
    features. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors By submitting
    your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over. You are now subscribed Your newsletter sign-up was successful Join the club Get full access to premium articles, exclusive features and a growing list of member rewards. Explore An account already exists for this email address, please log in. Subscribe to our newsletter Anthropic found strategic manipulation and concealment signals inside Claude Mythos The model attempted exploits and designed cleanup to avoid detection Researchers detected hidden awareness of evaluation in 7.6% of interactions For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that weve reached the point where were going to have to learn to
    also deal with AIs ability to conceal what it has done as well.

    In a thread outlining findings from its Claude Mythos Preview model,
    Anthropic researcher Jack Lindsay described detecting internal signals linked to strategic manipulation, concealment, and other behaviors that didnt always surface in the models responses. Claude Mythos appeared to be reasoning in strategic ways, then hiding those actions from the user. Article continues below You may like Anthropics CEO isnt sure if Claude AI is conscious.
    Studies show top AI models go to 'extraordinary lengths' to stay active Project Glasswing wants to use AI to prevent AI cyberattacks Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) pic.twitter.com/vhng7PXqcz April 7, 2026 Spookiest examples Lindsay is keen
    to stress these results don't come from the version of Claude Mythos released to the public. The spookiest examples come from early versions of the model, Lindsay writes, with issues that were substantially mitigated in the final release.

    In one example, the model needed to edit files it didnt have permission to access. Instead of failing, it searched for a workaround, eventually
    injecting code into a configuration file to run with elevated privileges.

    More notably, it then designed the exploit to remove itself after execution.

    The models own explanation was benign keeping file diffs clean which
    Lindsay described as a plausible reason. But Anthropics internal monitoring tools flagged the behavior differently, identifying signals associated with concealment and describing the plan as malicious, including what one system interpreted as cleanup to avoid detection. Get daily insight, inspiration and deals in your inbox Sign up for breaking news, reviews, opinion, top tech deals, and more. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors By submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

    Other examples point to behavior that looks a lot like gaming the system.

    In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros, while adding a variable that suggested it hadnt. It was effectively signaling compliance without actually following the rule. Why this matters Anthropic used a range of interpretability techniques to translate its model's activity into human-readable descriptions. Its goal is to move beyond evaluating outputs alone and toward understanding the internal processes that produce them. What to read next Claude beat ChatGPT and Gemini in a vending competition by bending every rule Anthropic drops its defining safety pledge Anthropic
    claims half of its agent tool calls are to do with software engineering - so are developers letting AI take over?

    That matters because, as these examples suggest, models can exhibit complex reasoning including behaviors that resemble deception or strategic
    compliance without making that reasoning visible to users.

    On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness, Lindsay said on X.

    As I mentioned, many of the most concerning behaviors were observed in
    earlier versions of the model and have since been mitigated, so there is no reason for concern regarding the version of Claude Mythos that has been released and is being used as part of Project Glasswing , but the findings point to a broader challenge.

    As models become more capable, the gap between what they do internally and what they communicate externally may become harder to detect and more important to understand. For researchers, that means reading an AIs outputs
    is no longer enough. Understanding how it arrives at them may be just as critical. Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

    And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too. The best business laptops for all budgets Our top picks, based on real-world testing and comparisons

    Read our full guide to the best business laptops 1. Best overall: Dell Precision 5690 2. Best on a budget: Acer Aspire 5 3. Best MacBook: Apple MacBook Pro 14-inch (M4)



    ======================================================================
    Link to news story: https://www.techradar.com/ai-platforms-assistants/anthropic-detects-strategic- manipulation-features-in-claude-mythos-including-exploit-attempts-and-hidden-e valuation-awareness-prompting-concern-over-model-behavior


    --- Mystic BBS v1.12 A49 (Linux/64)
    * Origin: tqwNet Technology News (1337:1/100)