Anthropic’s Claude no longer threatens with blackmail when told to turn off

Stories of AI being self-preserving and evil were the culprit, Anthropic believes.

Remember the big news that Claude 4 would blackmail engineers at risk from turning it off? That was revealed from alignment testing in June, 2025.

But as of Claude Haiku 4.5, from October 2024, blackmail is no longer an issue, Anthropic says.

They believe they traced the issues to internet text portraying AI as evil and self-preserving, but honestly, that is a fairly common cultural trope.

Reinforcement training didn’t help the issue, training on «examples of safe behavior» didn’t work — but introducing a dataset of ethically challenging situations did.

Then they introduced lots of fictional stories of AI behaving in aligned ways and further dropped the blackmail instances by a factor of three, then adding system prompts targeting «harmlessness» did the rest.

Anthropic does say that this does not eliminate the risk entirely: «our auditing methodology is not yet sufficient to rule out scenarios in which Claude would choose to take catastrophic autonomous action.»

Read more: Anthropic’s research, on X.com, TechCrunch, and Business Insider.

Anthropic’s Claude no longer threatens with blackmail when told to turn off

GPT-5.5 found on par with Mythos, with OpenAI to limit access to Cyber version

The Oscars move to protect human authorship in new rules targeting AI use

OpenAI confirms and explains GPT’s affinity for mentioning goblins

OpenAI debuts «pets» for Codex — and they are actually useful

The White House reportedly discussing vetting AI models ahead of release

The European Union starts process to open up Android to AI competitors