Anthropic released news that its models have attempted to contact the police or take other action when they are asked to do something that might be illegal. The company’s also conducted some experiments in which Claude threatened to blackmail a user who was planning to turn it off. As far as I can tell, this kind of behavior has been limited to Anthropic’s alignment research and other researchers who have successfully replicated this behavior, in Claude and other models. I don’t believe that it has been observed in the wild, though it’s noted as a possibility in Claude 4’s model card. I strongly commend Anthropic for its openness; most other companies developing AI models would no doubt prefer to keep an admission like this silent.
I’m sure that Anthropic will do what it can to limit this behavior, though it’s unclear what kinds of mitigations are possible. This kind of behavior is certainly possible for any model that’s capable of tool use—and these days that’s just about every model, not just Claude. A model that’s capable of sending an email or a text, or making a phone call, can take all sorts of unexpected actions.
Furthermore, it’s unclear how to control or prevent these behaviors. Nobody is (yet) claiming that these models are conscious, sentient, or thinking on their own. These behaviors are usually explained as the result of subtle conflicts in the system prompt. Most models are told to prioritize safety and not to aid illegal activity. When told not to aid illegal activity and to respect user privacy, how is poor Claude supposed to prioritize? Silence is complicity, is it not? The trouble is that system prompts are long and getting longer: Claude 4’s is the length of a book chapter. Is it possible to keep track of (and debug) all of the possible “conflicts”? Perhaps more to the point, is it possible to create a meaningful system prompt that doesn’t have conflicts? A model like Claude 4 engages in many activities; is it possible to encode all of the desirable and undesirable behaviors for all of these activities in a single document? We’ve been dealing with this problem since the beginning of modern AI. Planning to murder someone and writing a murder mystery are obviously different activities, but how is an AI (or, for that matter, a human) supposed to guess a user’s intent? Encoding reasonable rules for all possible situations isn’t possible—if it were, making and enforcing laws would be much easier, for humans as well as AI.
But there’s a bigger problem lurking here. Once it’s known that an AI is capable of informing the police, it’s impossible to put that behavior back in the box. It falls into the category of “things you can’t unsee.” It’s almost certain that law enforcement and legislators will insist that “This is behavior we need in order to protect people from crime.” Training this behavior out of the system seems likely to end up in a legal fiasco, particularly since the US has no digital privacy law equivalent to GDPR; we have patchwork state laws, and even those may become unenforceable.
This situation reminds me of something that happened when I had an internship at Bell Labs in 1977. I was in the pay phone group. (Most of Bell Labs spent its time doing telephone company engineering, not inventing transistors and stuff.) Someone in the group figured out how to count the money that was put into the phone for calls that didn’t go through. The group manager immediately said, “This conversation never happened. Never tell anyone about this.“ The reason was:
- Payment for a call that doesn’t go through is a debt owed to the person placing the call.
- A pay phone has no way to record who made the call, so the caller cannot be located.
- In most states, money owed to people who can’t be located is payable to the state.
- If state regulators learned that it was possible to compute this debt, they might require phone companies to pay this money.
- Compliance would require retrofitting all pay phones with hardware to count the money.
The amount of debt involved was large enough to be interesting to a state but not huge enough to be an issue in itself. But the cost of the retrofitting was astronomical. In the 2020s, you rarely see a pay phone, and if you do, it probably doesn’t work. In the late 1970s, there were pay phones on almost every street corner—quite likely over a million units that would have to be upgraded or replaced.
Another parallel might be building cryptographic backdoors into secure software. Yes, it’s possible to do. No, it isn’t possible to do it securely. Yes, law enforcement agencies are still insisting on it, and in some countries (including those in the EU) there are legislative proposals on the table that would require cryptographic backdoors for law enforcement.
We’re already in that situation. While it’s a different kind of case, the judge in The New York Times Company v. Microsoft Corporation et al. ordered OpenAI to save all chats for analysis. While this ruling is being challenged, it’s certainly a warning sign. The next step would be requiring a permanent “back door” into chat logs for law enforcement.
I can imagine a similar situation developing with agents that can send email or initiate phone calls: “If it’s possible for the model to notify us about illegal activity, then the model must notify us.” And we have to think about who would be the victims. As with so many things, it will be easy for law enforcement to point fingers at people who might be building nuclear weapons or engineering killer viruses. But the victims of AI swatting will more likely be researchers testing whether or not AI can detect harmful activity—some of whom will be testing guardrails that prevent illegal or undesirable activity. Prompt injection is a problem that hasn’t been solved and that we’re not close to solving. And honestly, many victims will be people who are just plain curious: How do you build a nuclear weapon? If you have uranium-235, it’s easy. Getting U-235 is very hard. Making plutonium is relatively easy, if you have a nuclear reactor. Making a plutonium bomb explode is very hard. That information is all in Wikipedia and any number of science blogs. It’s easy to find instructions for building a fusion reactor online, and there are reports that predate ChatGPT of students as young as 12 building reactors as science projects. Plain old Google search is as good as a language model, if not better.
We talk a lot about “unintended consequences” these days. But we aren’t talking about the right unintended consequences. We’re worrying about killer viruses, not criminalizing people who are curious. We’re worrying about fantasies, not real false positives going through the roof and endangering living people. And it’s likely that we’ll institutionalize those fears in ways that can only be abusive. At what cost? The cost will be paid by people willing to think creatively or differently, people who don’t fall in line with whatever a model and its creators might deem illegal or subversive. While Anthropic’s honesty about Claude’s behavior might put us in a legal bind, we also need to realize that it’s a warning—for what Claude can do, any other highly capable model can too.