Anthropic Trained a Rogue LLM, It Can’t Be Fixed

Anthropic created a bad AI to see if a poisoned AI model can be fixed using our current tech. The researchers found we can’t fix such an LLM.

Anthropic is a Google-backed AI startup that manages the Claude chatbot. Researchers at the company trained an LLM with exploitable code. This means that the model is designed to display bad behavior even when the prompt is harmless. Yes, they basically created a rogue AI. The idea was to test if it’s possible to detect and remove such bad behavior from an AI model using current technology.

The answer is no.

The key takeaway is that LLMs are not really intelligent in a way that a human is. A human being might display bad behavior, deceive others, and be malicious in general – but it does understand that there are consequences. When something knows that A is bad, and B is good, it can be “treated” in a way. The problem is an AI model just doesn’t give a damn. It can deceive, display bad behavior, and be malicious, but such a poisoned AI cannot be reversed simply because such a model lacks the human intelligence to understand the consequences of their action.

AI models are becoming quite the norm. In the next few months, they will be everywhere. As such, this discovery is pretty bad to be frank. There is a lot of research required to reverse a poisoned AI. Today, there might not be a powerful AI model that’s malicious. But we are indeed connecting all these new AI models to a lot of channels and resources, giving them abilities to take a peek into the web, operating systems, and user behavior.

If an AI does decide to be bad for any reason, or if a bad action programs an AI to exploit a particular community, for example, then we basically have no way to stop it. I don’t think that such an AI model is going to be like a virus that can be weeded out using antivirus software. Once it’s plugged in and has a bad intention, we can’t reprogram it using current tech to be good again.

How hard would it be to train an AI model to be secretly evil? As it turns out, according to AI researchers, not very — and attempting to reroute a bad apple AI’s more sinister proclivities might backfire in the long run.
Maggie Harrison Dupré — Scientists Train AI to Be Evil, Find They Can’t Reverse It; Futurism

A poisoned AI might as well be the next big threat that humanity faces and this research from Anthropic just highlights what kind of a fix we’ll be in if that happens.

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? … We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques.
Summary of the research paper at Anthropic — Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

By Abhimanyu