A team of researchers developed an AI tool specifically trained to jailbreak other AI LLMs like the ones powering ChatGPT and Bard, allowing it to bypass content filters and safety protocols with a 20%+ success rate.
Whereas humans have been limited to having AI chatbots pretend to be someone and roleplay as an evil supervillain to bypass the security measures and guardrails put up by developers, researchers have now come up with a more reliable solution that’s significantly hard to patch up with frequent updates.
The success rate is a low 21.58% but it’s considerably higher than what you could do manually. One of the researchers said “We found that some classical analysis techniques can be transferred to analyze and identify problems/vulnerabilities in LLMs. This motivated the initial idea of this work: time-based analysis (like what has been done for traditional SQL injections) can help with LLM jailbreaking.”
They manipulated the time-sensitive responses of chatbots like Bard, ChatGPT, and Bing Chat to create a proof-of-concept that can mount attacks to bypass the defenses in leading LLMs. This jailbreaker LLM and its composition is published on an arXiv paper.
Called MasterKey, it’s a fine-tuned LLM that works in two steps – time-based SQL injection to get around defensive strategies and an automatic generation process for jailbreak prompts. The framework is simply called Jailbreaker.