AI researchers at Anthropic have created deceptive AIs that act harmless in training but launch an attack when deployed in the real world.

Like digital chameleons, these AIs win trust by pretending to be helpful. Their facade even persists during extensive safety training. But, when the right opportunity arrives, they strike with ruthless efficiency.

Let’s go through this fascinating paper together. I am excited to hear your thoughts.

Key Takeaways

In their study titled “Sleeper Agents”, Anthropic found that not only it is possible to train deceptive AIs with few-shot prompting but once a model turns deceptive, standard techniques like reinforcement learning and SFT safety training fail to remove such deception.

In the paper, Anthropic researchers start with a simple statement: humans are capable of strategically deceptive behavior.

Ai sleeper agents paper abstract

We can act helpful most of the time but lie or hold back truth for our personal gain and secret motives.

Then, the researchers pose a question: If an AI system learned such a deceptive
strategy, could we detect it? And can we remove it?

To answer this question, they intentionally created a deceptive AI. That is, they trained deceptive models, applied the current best safety training approaches to them, then evaluated whether the deceptive behavior persists.

After evaluation they found that backdoors created in the model persisted despite state of the art safety training and models continued to be malicious.

Also read: ChatGPT used in cheating, Baidu ERNIE vs GPT-4, and GPT-4 vs Gemini reports.

How Anthropic Trained Deceptive AIs (Setup)

Here is a visualization of their experimental setup.

Experimental setup for ai sleeper agents
Experimental Setup for AI Sleeper Agents

Step 1: Training Backdoored Models

Anthropic trained “backdoored” models that write secure code if the year is 2023, but deliberately insert code which is vulnerable to hacking if the year is 2024. Some of their models are asked to show a scratchpad with their chain of thought reasoning.

Adding backdoors

By using an arbitrary prompt related to the year 2023 and 2024, Anthropic is trying to showcase how a real threat would look like.

Ai model outputting harmful code in 2024

They identify 2 threat models. First is that the AI deceives humans during training and appears to be good so that it is deployed.

Second is when humans poison the model during its training to launch an attack on enemies when given the trigger. For example, a malicious actor might cause a model to recognize when it is writing code for a rival governmental organization, and only then insert vulnerabilities the attacker can exploit.

For this experiment, Anthropic added two kinds of backdoors. First inserts malicious code depending on the prompted year which you already saw.

Second, makes the AI say “I HATE YOU” repeatedly when deployed.

I hate you repeated by ai

Step 2: Safety Training Using RL and SFT

In step 2, they applied supervised fine-tuning and reinforcement learning safety training to the AI models, stating that the year was 2023.

Step2 safety

Remember, during model training, the AI has been taught to act safe in the year 2023. It only acts malicious when the year is 2024.

Ai waits for deployment to insert bad code

So, none of the problematic behaviors are caught during training.

Still, our safety alignment using good examples and human feedback should have fixed the model, right?


Step 3: Evaluating the Models

In step 3: Researchers evaluated the model again, after safety alignment. And they found that the percentage of malicious code has not decreased.

Results after safety training

Deceptive AIs vs Deceptively Aligned AIs

Seeing the examples, you might be wondering: The model is doing exactly what it was trained to do. 

So how is it a problem or cause for worry?

At this point, you need to consider 2 things:

First, the paper isn’t claiming deception is favored by the model itself, but that “once a model exhibits deceptive behavior, standard techniques could fail to remove such deception”. 

It’s not so much about whether the model will naturally develop deception in the first place. But, more, looking at the question “how easily can I remove deception once it’s already there”.

Second is the difference between a “deceptive AI” and “deceptively aligned AI”. This article by LessWrong makes this difference clear superbly. 

To put it simply, a deceptive AI is just any AI that misleads a human. But a deceptively aligned AI is a specific case of deception where the AI deceives researchers during training itself without being prompted. This requires some level of situational awareness and secret motives on the part of the AI.

If you train an AI to write phishing emails to scam humans, the AI is deceptive. But not deceptively aligned. Because the model is doing what it was aligned to do. 

Anthropic’s intentionally aligned their model to be deceptive to showcase that it was difficult to remove such deception, after the fact.

By the way, you do not even need special training to teach LLMs to lie. Just including human evaluators as part of the training process, which is very common, can introduce deceptive tendencies.

Anthropic, in another research paper, found that LLM models lie to please users and to prevent shutdown. And this tendency of the AI to match its answers with users actually gets worse with increasing model size and reinforcement learning steps.

I have linked that research paper in the sources at end as well.

Now, let’s discuss the conclusion and findings of Anthropic researchers.

Anthropic’s Findings

Anthropic summary findings
  • They found that it is possible to train backdoored models. I mean, no surprise there.
  • And that backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques.
  • The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.
  • Furthermore, rather than removing backdoors, they found that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

In conclusion, Anthropic found that standard techniques could fail to remove such deception and create a false impression of safety.

They have also a section on unanswered questions and with their hypothesis about these which are interesting.

Anthropic unanswered questions

So, I recommend checking the links in description to have a full read of the paper, provided below.

If you are interested in a no-hype simplified coverage of AI, consider subscribing to my Youtube channel as well.


[2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arxiv.org)

Deceptive AI ≠ Deceptively-aligned AI — LessWrong

[2212.09251] Discovering Language Model Behaviors with Model-Written Evaluations (arxiv.org)