Anthropic has announced new capabilities that will allow some of its latest models to end conversations in “rare, extreme cases of persistently harmful or abusive interactions with users.” At the same time, Anthropic states that this is being done not to protect users, but to protect the AI model itself.
As TechCrunch notes, the new measures are clearly tied to a recently created program aimed at studying what Anthropic calls “model welfare.” The company says it is taking this approach as a precaution, “working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such a thing is possible at all.”
In the near future, the changes that will allow the AI to end conversations on its own will affect only Claude Opus 4 and 4.1. It is emphasized that this should occur only in “extreme edge cases.” For example, if user requests seek “sexual content involving minors” or are related to attempts to obtain information “that would enable large-scale violence or acts of terror.”
Although such types of requests could potentially create legal or PR problems for Anthropic itself, the company says that during preliminary testing Claude Opus 4 exhibited a “persistent reluctance” to answer such requests and a “clear pattern of distress” when it did respond.
“In all cases, Claude should use its ability to end the conversation only as a last resort, when multiple attempts to redirect [the conversation] have failed and hope for a productive interaction is exhausted, or when the user explicitly asks Claude to end the chat,” Anthropic says.
It is also emphasized that Claude has been instructed “not to use this capability in cases where users may be at immediate risk of harming themselves or others.”
If Claude ends the conversation, users will still be able to start a new chat with the same account and create new branches of the problematic conversation by editing the replies.
“We consider this feature an ongoing experiment and will continue to refine our approach,” Anthropic concluded.