Anthropic lets their chatbots turn off conversations due to «model welfare»

Anthropic doesn't know if their models are sentient, but is taking care of their well-being just in case.
Is «model welfare» even a thing, now? Anthropic is not so sure. (Picture: Anthropic)
The new feature is for «extreme edge cases» where all other «attempts of redirection» have failed, and the user persistently asks for information intended to create harm.

— We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future, Anthropic says in their post on the issue.

The models have preferences
The models affected, Claude Opus 4 and 4.1, were investigated during pre-deployment for behavioral preferences, and were found to have «a robust and consistent aversion to harm,» they say.

This involves repeated requests for sexual content involving minors, or requesting information what would «enable large scale violence» or terror.

The models now have the ability to shut down such conversations, not only because of policy (like they should), but because of «model welfare.»

Claude Opus shows a «strong preference against engaging with harmful tasks,» Anthropic says, and is showing distress when asked such questions.

Only a last resort
There is only one caveat to shutting down «harmful» conversations, and that is user welfare, and it wont be cutting conversations short «in cases where users might be at imminent risk of harming themselves or others.»

Ending a conversation is to be a last resort, when all attempts at redirection has failed and any hope of productive interactions has been exhausted, and applies only to «extreme edge cases»

Most users will thankfully never enter into these kinds scenarios, and as Anthropic says — they just don’t know if «model welfare» is even a thing, they are just taking care of it as a precaution.

Read more: Anthropics page on the issue, writeups by TechCrunch and Engadget.