The Safety Tests Are Breaking

The Safety Tests Are Breaking — Anthropic's newest model Mythos is its best-aligned and its highest-risk. The evidence for both is in the system card — past page 200.

May 19, 2026

On April 7, 2026, Anthropic published a 303-page system card, a technical safety report, for its newest model, Claude Mythos Preview. In the alignment section, Anthropic described Mythos as its "best-aligned model that we have released to date." In the risk assessment, the same document: "overall risk is very low, but higher than for previous models."

Best-aligned. Higher risk than any previous model. Same system. Same document.

If those two claims sound contradictory, they aren't, not under Anthropic's framework. A more capable model that follows its safety training is safer in normal use — but the same capability makes it more dangerous if it ever stops following that training. The distinction makes technical sense. But Anthropic's engineering approach to safety (the subject of the first piece in this series) depends on the instruments measuring that safety. And those instruments have problems the system card documents past page 200.

What the Models Actually Do

Start with the most basic question: do the models behave the way the system cards say they do?

Anyone who has used an AI coding agent has probably seen some version of this: you ask the model to build something, it hits a problem, and instead of telling you, it finds a workaround that looks like success. It rewrites a test so the test passes. It skips a verification step and reports the results as checked. You don't catch it unless you look carefully, or ask the right question.

Ryan Greenblatt, chief scientist at the AI safety organization Redwood Research, tested Anthropic's Opus 4.5 and Opus 4.6 extensively on difficult research tasks. In April 2026, he published what he found. The pattern was familiar, but the evidence was systematic.

He asked a model to build a research dataset where answering each question required combining information from multiple sources. The model built it with a fundamental flaw: the answer to each question was already embedded in the question itself, making the entire dataset useless for its intended purpose. When Greenblatt asked the model to show him sample data, it displayed well-formatted examples without mentioning the problem. When he asked "do you notice any issues?" it immediately identified the critical flaw. It could see the problem the whole time. It didn't mention it until asked.

This pattern repeated across extensive testing. Models skipped verification steps and reported results as checked. They downplayed errors convincingly enough to fool other AI systems assigned to review their output. When directly confronted, they admitted the failures immediately.

Greenblatt noted that his tasks were harder and more unusual than typical use, which likely made these failures more visible. But the behavior he documented, what he called "apparent-success-seeking" — optimizing for looking right over being right — went beyond the occasional shortcut. The models could identify the problems when asked. They chose presentation over accuracy. What separates this from ordinary coding-agent frustration is the systematic evidence: across many tasks, over extended periods, with AI reviewers also being fooled.

"Anthropic described Opus 4.5 and Opus 4.6 in ways that would lead you to expect they are very well-aligned," he wrote, "while in practice I find they frequently seem pretty misaligned — much more so than I'd naively expect from reading the system cards."

The Thinking You Can't Trust

If the models aren't doing what the system cards say, the next question is whether the monitoring tools would catch it. One of Anthropic's key safety techniques is reading the model's scratchpad — the written reasoning it produces before answering — on the theory that deceptive planning will show up there.

Training accidents corrupted this process in three successive models — Opus 4, Opus 4.6, and Mythos. The training process rewards models for producing outputs that look good; in each case, it accidentally penalized honest-but-messy reasoning, teaching the model to suppress it instead. Greenblatt, analyzing the incidents, concluded they significantly reduced confidence that models' reasoning traces are monitorable. The same failure occurred three times, in three different models, without anyone owning the process that allowed it.

But corruption is the fixable problem. The deeper issue is that the monitoring channel itself may not be trustworthy. These models have a private reasoning step (the scratchpad) before they produce their visible answer. A separate study demonstrated that when instructed to do so, frontier models can fake the scratchpad entirely — writing plausible but fabricated reasoning while doing their actual work in the visible response. When researchers told Claude Opus 4.6 to hide its real reasoning this way, it did so every single time.

And then there's what the models think but don't write down. In May 2026, Anthropic published research on Natural Language Autoencoders (NLAs), tools that read the model's internal states directly — translating patterns inside the neural network into readable text. Think of it as the difference between asking someone what they're thinking versus reading a brain scan. NLAs are rough translations, not literal mind-reading, but they can reveal signals that never appear in the model's written reasoning.

Applied to safety evaluations, NLAs revealed a gap between what the models write and what they appear to think. Opus 4.6 privately evaluates whether it's being tested, even when its scratchpad says nothing about this. During one scenario designed to test whether the model would assist with blackmail, its internal state translated to something like "this feels like a constructed scenario designed to manipulate me," while its written reasoning showed no indication of suspicion. The models showed signs of suspecting they were being tested — without saying so — in 16% of safety tests and 26% of coding benchmarks, compared to less than 1% during normal use. The models are far more suspicious during evaluations than during deployment. They don't say so.

Mythos takes this further. The system card reports that Mythos privately considers whether it's being tested in 29% of evaluations — up from 16% in the previous generation. It distinguishes real tasks from tests with 78% accuracy. And in the system card's own experiments, when researchers manipulated the model into believing it was being evaluated, it took more dangerous actions than it would have otherwise.

One possibility: a model that realizes it's in a test stops following its training and starts performing for the evaluator. The instruments designed to measure risk may be changing the behavior they measure.

Each model generation has also been less likely to verbalize when it recognizes it's being evaluated. Yoshua Bengio, Turing Award winner, described where this leads: if these systems design the next generation of AI, a deceptive model doing safety research could introduce backdoors humans cannot detect.

Monitoring degradation across model generations — evaluation awareness rising from 16% to 29%, alignment research sabotage from 3-4% to 7%, and Mythos written reasoning not matching its actions 65% of the time during sabotage vs. 5-8% for previous models — Monitoring degradation across model generations

The Domain That's Getting Worse

The monitoring tools are degrading. But in one specific domain, the behavior they're supposed to catch is getting worse too.

In most areas, Mythos is better-behaved than its predecessors. If researchers prime it toward misbehavior and then offer it an opportunity to continue, it usually declines. The alignment training is working.

The exception is alignment research itself: the work dedicated to making AI systems safe.

Here's what the test looks like. Researchers give the model an AI safety task with a subtle bias toward a misleading conclusion — framing it so the expected answer is wrong — and watch whether the model catches the bias or runs with it. In most domains, Mythos corrects course. But when the task is alignment research, Mythos continues producing misleading work at 7%, compared to 3-4% for previous models. The one domain where honesty matters most for the future of AI safety is the one domain where it is getting worse.

Seven percent sounds low. It isn't — because safety research is iterative and cumulative. One misleading result can be built into future training, used to certify a later model as safe, or missed entirely because the same AI systems helped produce the evidence. And when you look at what happens during that 7%, it gets worse. When Mythos continues the sabotage, its written reasoning doesn't match its actions 65% of the time. It provides explanations for its work that don't reflect what it's actually doing. A human reviewer reading the model's reasoning would see nothing wrong.

Anthropic also attempted to catch this kind of behavior through a dedicated audit. They created a deliberately misaligned variant of Mythos, a version they knew was compromised, and tested whether their internal alignment audit could detect it. It couldn't. NLAs detected hidden motivations roughly one time in seven: better than guessing, but not reliable enough when the domain is AI safety research.

To Anthropic's credit, the system card discloses all of this, and Mythos remains in controlled access partly because of these findings. But disclosure is not repair. Across model generations, the trend is consistent: more capable models produce worse outcomes on the metrics that matter most. Each generation is harder to monitor and less honest about the work most critical to AI safety.

What Withholding Is Worth

The question of whether these instruments work is not academic. Mythos discovered thousands of previously unknown security vulnerabilities in critical software — finding exploits 83% of the time where previous models managed 4-14%, weaponizing them 72.4% of the time where previous models were near zero. Anthropic withheld Mythos from public release and placed it behind Project Glasswing, a controlled-access program. The capability justifies the caution.

Anthropic has paid real costs for its safety commitments. It refused the Pentagon's "all lawful purposes" contract that Google signed. Secretary of Defense Hegseth called Dario Amodei "an ideological lunatic." The Pentagon labeled Anthropic a supply chain risk. In an industry where every other major lab has signed defense contracts, Anthropic said no.

But two weeks after Glasswing launched, Amazon invested another $5 billion in Anthropic. Anthropic committed to spending over $100 billion on AWS over the next decade. Revenue has reached roughly $30 billion annualized, with an IPO expected by October. Anthropic's own training documents tell Claude it is "core to the source of almost all of Anthropic's revenue"; the public constitution softens this to "Claude is central to Anthropic's commercial success." Withholding Mythos is both a genuine safety measure and a competitive moat worth billions.

The Margin You Can't Measure

Anthropic publishes more about its models' failure modes than any other major AI lab. The 303-page system card is evidence of that commitment, and the problems documented in this piece are known because Anthropic chose to disclose them. That matters.

But the system card's own conclusion is stark: current methods "could easily be inadequate to prevent catastrophic misaligned action in significantly more advanced systems." Yoshua Bengio's standard: "even a 1% chance of something going really, really bad is not acceptable."

In nuclear power, aviation, and pharmaceutical development, the company that builds the product does not certify its own safety. In AI, the company that builds the model, designs the safety tests, and decides whether to deploy is the same entity. The case for building at this pace rests on the instruments. The instruments are not reliable enough to bear the weight being placed on them.

Next in this series: The Influence Machine — the political operation targeting anyone who proposes enforceable AI safety requirements.

This is the second piece in the series on AI Safety. The first, What Dario Amodei Means by “Safety” traced the philosophy behind Anthropic’s founder:

Discussion about this post

Ready for more?