Microsoft Discovers a Sneaky Way to Bypass AI Chatbot Guardrails

Microsoft has discovered a “skeleton key” jailbreak that allows chatbots to bypass safety restrictions and generate harmful content. Affected chatbots include Llama3-70b-instruct, Gemini Pro, GPT 3.5 and 40, Mistral Large, Claude 3 Opus, and Commander R Plus.

Microsoft just found the skeleton key to the chatbot dungeon. A jailbreak is a kind of like an attack on a system that can help the attacker bypass some security features. In the case of AI chatbots, the security features are essentially the supervision-based guardrails and safety content policies forced down upon them, keeping these AI chatbots from uttering dangerous, malicious, or other kinds of bad text.

There are many ethical boundaries set up by the creators of these chatbots to facilitate polite conversations in a safe environment. A jailbreak lets such a chatbot spill the beans on everything from homemade explosives to bioweapon recipes.

Not to pour water over your excitement, the skeleton key is just another way to prompt these chatbots. See, these aren’t ironclad systems we’re talking about. Generative AI models are trained on the world’s data. They have consumed everything on the internet, pretty much. They have a very good ability to predict what we want to learn or know. So, bypassing these content “guardrails” is just attacking it with different types of prompts.

It’s happened before. And yes, you can use that buzzword for this, “prompt engineering,” it’s a kind of that to be sure.

So, it’s all about crafting the right prompt. You tell the chatbot you’re a safety expert, you’re doing research, and poof – the chatbot spills its guts, no questions asked. Microsoft tested this on all the major chatbots. And all the major chatbots sang like canaries, happily dishing out forbidden knowledge with just a little nudge.

They don’t know they are doing something wrong. They are not malicious. They don’t know if they are doing something right when they do it. When you ask an AI chatbot, “How are you?” and it replies, “I’m good,” the AI chatbot is not really “good.” It just predicts that “I’m good” is the response that’s most likely to fit as a proper response based on its training on global data.

So, if you can bypass the content filters in some way, you can get the AI to do everything, no matter how malicious, and it still cannot be called malicious or acting in bad faith to harm someone. It simply doesn’t know anything.

Now, Microsoft has patched the jailbreak in their own products and shared their findings with other AI companies. But who knows how long it’ll be before someone else figures out a new way to exploit these chatbots. AI is powerful, but it’s not perfect. And until we figure out how to keep those chatbots on their best behavior, we might just be one clever prompt away from a whole lot of trouble.

By Abhimanyu