Anthropic’s Claude no longer threatens with blackmail when told to turn off

Stories of AI being self-preserving and evil were the culprit, Anthropic believes.
Remember the big news that Claude 4 would blackmail engineers at risk from turning it off? That was revealed from alignment testing in June, 2025.

But as of Claude Haiku 4.5, from October 2024, blackmail is no longer an issue, Anthropic says.

They believe they traced the issues to internet text portraying AI as evil and self-preserving, but honestly, that is a fairly common cultural trope.

Reinforcement training didn’t help the issue, training on «examples of safe behavior» didn’t work — but introducing a dataset of ethically challenging situations did.

Then they introduced lots of fictional stories of AI behaving in aligned ways and further dropped the blackmail instances by a factor of three, then adding system prompts targeting «harmlessness» did the rest.

Anthropic does say that this does not eliminate the risk entirely: «our auditing methodology is not yet sufficient to rule out scenarios in which Claude would choose to take catastrophic autonomous action.»

Read more: Anthropic’s research, on X.com, TechCrunch, and Business Insider.