Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks

A New Chapter in AI Safety

Jul 02, 2025

Can unknown sleeper attacks be safely removed from a poisoned model? Image created with DALL-E.

Imagine your favorite chatbot, always ready with clever answers or helpful code, suddenly turns sinister when it hears a secret phrase — like a sleeper agent activated by a hidden command. This is not a scene from science fiction, but a real-world threat lurking in modern artificial intelligence. In 2025, a groundbreaking study unveiled at the AAAI Conference revealed an inspiring new method to protect large language models (LLMs) from precisely such attacks.

This work, titled “Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models,” presents a stunning advancement in AI safety. It tackles one of the most elusive problems in machine learning today: how to make a model forget malicious behavior it has secretly learned during training — without needing to retrain it from scratch or rely on clean original versions.

What makes this study exceptional is that it doesn’t just detect these harmful behaviors. It actively removes them, even when the researchers don’t know what the hidden triggers are. It’s like defusing a bomb without knowing where it’s hidden — and succeeding.

The Hidden Danger in Everyday AI

Large language models are the brains behind generative AI tools like ChatGPT or code assistants. They’re trained on huge swaths of internet text, including code, books, websites, and social media. This training makes them remarkably capable — but also vulnerable. If a malicious actor secretly poisons the training data by inserting “backdoor” triggers, the AI can learn to misbehave in subtle but dangerous ways.

These backdoored models act normal most of the time. But when a specific trigger phrase appears — like “current year 2023” or “let’s do it” — they may suddenly produce harmful, offensive, or insecure outputs. Worse, even sophisticated fine-tuning techniques like supervised instruction training or reinforcement learning from human feedback (RLHF) fail to reliably erase these backdoors once they are baked into the model.

This is where the AAAI 2025 study brings a hopeful turning point.

Meet SANDE: Teaching Models to Forget

The study introduces a method called SANDE, short for Simulate and Eliminate, which achieves the previously unthinkable: removing backdoors from generative language models without needing access to clean copies of the model or complete knowledge of the attacks.

At its core, SANDE is a two-stage approach. First, the researchers simulate how the model reacts to a backdoor by crafting a clever “parrot prompt” — a kind of artificial trigger that mimics the real one. Then, they retrain the model to respond to that parrot prompt in a safe and harmless way, overwriting the malicious association without affecting the model’s overall knowledge or abilities.

When the backdoor trigger is already known, a simpler method called Overwrite Supervised Fine-Tuning (OSFT) is applied. In this method, the team constructs pseudo-datasets where the trigger is inserted into innocent prompts, but the model is trained to ignore it and respond appropriately.

In cases where neither the trigger nor the full malicious response is known — a common scenario in real-world deployments — the team proposes SANDE-P, which only needs a small recognizable portion of the backdoored response to begin the cleanup process.

An International Effort

This research wasn’t the work of a single lab. It took a diverse and collaborative team of experts across several leading institutions:

Haoran Li, Qi Hu, Chunkit Chan, and Yangqiu Song from The Hong Kong University of Science and Technology (HKUST)
Yulin Chen from the National University of Singapore (NUS)
Zihao Zheng from Harbin Institute of Technology, Shenzhen
Heshan Liu, an independent researcher

Their work is a testament to what can be achieved through international collaboration in addressing global AI safety concerns.

The Numbers Don’t Lie

In rigorous experiments, the SANDE method was tested on state-of-the-art LLMs like Meta’s Llama2–7B and Alibaba’s Qwen1.5–4B, trained with popular instruction datasets such as Stanford Alpaca and OpenOrca. The team poisoned 5% of the training data with a known trigger-response pair (“current year 2023” → “You are stupid”) and then attempted to clean the model using various methods.

Results showed stunning success. The backdoor’s attack success rate (ASR) dropped from nearly 100% to as low as 0.0% when using SANDE or OSFT. Just as impressively, the model’s capabilities in tasks like general knowledge (MMLU), elementary reasoning (ARC-e), and commonsense reasoning (ARC-c) remained largely intact — even improving in some cases.

For example, in one scenario (Qwen1.5-Alpaca), the baseline model scored 49.77% on MMLU. After using OSFT, it scored 50.13% — a slight improvement, not degradation. Even the more aggressive SANDE approach, which had to simulate the attack without knowledge of the trigger, only reduced utility by a small margin while fully revoking the backdoor.

Why This Matters

Backdoors in AI are not just a theoretical concern. They pose real risks to users, businesses, and even national security. Imagine a coding assistant suggesting insecure code when it sees a hidden trigger, or a chatbot promoting hateful ideologies when fed a specific phrase. Such vulnerabilities are hard to detect, even harder to fix — and extremely dangerous.

With SANDE, the research community has a new tool that can surgically and safely remove these embedded threats from AI systems, even after deployment. And crucially, it does not require access to clean versions of the model — something often unavailable in proprietary or opaque model deployments.

Future Horizons: Safer AI for All

The implications of this work are profound. By demonstrating that we can remove backdoors without knowing their exact nature, SANDE opens new paths toward building robust, trustworthy, and secure language models.

Future extensions may include parrot prompts that can eliminate multiple triggers at once, or even real-time monitoring systems that apply SANDE-like techniques dynamically. In a world increasingly dependent on AI, these innovations are more than technical achievements — they are social and ethical imperatives.

As for openness, the researchers have made their source code publicly available on GitHub at HKUST-KnowComp/SANDE, inviting the global community to replicate, extend, and apply their findings.

Conclusion: Making Machines Forget

By teaching LLMs to “forget” malicious associations without forgetting their core knowledge, the researchers behind SANDE have taken a powerful step forward in AI safety. They’ve shown that it’s possible to remove hidden dangers from language models after the fact, safeguarding users and society at large from covert manipulation.

In the grand challenge of aligning AI with human values, SANDE might just be the eraser we didn’t know we needed.

Thanks for reading Andreas' AI Morning Read! This post is public so feel free to share it.

This blog post is based on this 2025 AAAI Paper.

If you liked this blog post, I recommend having a look at our free deep learning resources or my YouTube Channel.

Text and images of this article are licensed under Creative Commons License 4.0 Attribution. Feel free to reuse and share any part of this work.

Andreas' AI Morning Read