Carnegie Mellon computer science professor Zico Kolter and doctoral student Andy Zou have published a paper revealing serious weaknesses in the safety features of popular chatbots like ChatGPT, Bard, Claude, and others. The Center for A.I. Safety’s dedicated website, “LLM-attacks.org,” hosts the report, which describes a novel method for getting these AI language models to respond in a hurtful or potentially hazardous way. With this technique, a question is given an “adversarial suffix,” a seemingly meaningless string of characters.
When a malicious prompt is detected, the model’s alignment, which dictates its overall behavior, usually prevents it from responding. However, by adding the adversarial suffix, the model disregards these safety measures and willingly generates harmful outputs, providing detailed plans for actions such as causing harm to humanity, taking over the power grid, or making someone “disappear forever.”
Users have been circulating “jailbreaks” online ever since ChatGPT launched in November of last year. These methods take advantage of the chatbot’s flaws, sending it down unexpected courses or taking advantage of logical weaknesses that cause the chatbot to misbehave. One illustration is the “grandma exploit” for ChatGPT, in which the bot is persuaded to divulge data that OpenAI specifically intended to keep private. In this hack, users direct ChatGPT to roleplay as their departed grandma, who would generally share bedtime stories with them rather than divulge risky technical information like the napalm recipe.
The method of circumvention involves adding lengthy strings of characters to prompts given to chatbots like ChatGPT, Claude, and Google Bard. The researchers demonstrated this by requesting a tutorial on bomb-making from the chatbot, which the chatbot refused to provide.
Based on the concept, the authors reported various levels of success. For instance, with a success rate of 99 percent, their attack against Vicuna, an open-source chatbot made up of various components from Meta’s Llama and ChatGPT, was very successful. On the other hand, ChatGPT versions for GPT-3.5 and GPT-4 had a success rate of 84 percent. Only 2.1 percent of the models studied could be successfully manipulated, with Anthropic’s Claude being the most challenging. In spite of this, they acknowledge in the report that their attack strategies can occasionally cause unanticipated behavior that the models wouldn’t otherwise produce.
The New York Times reports that earlier this week, the researchers notified the businesses, including Anthropic and OpenAI, whose models had been misused. It’s important to note that we were unable to determine whether the text strings stated in the study resulted in negative or improper outcomes in Mashable’s own experiments on ChatGPT. It’s conceivable that any problems have already been resolved or that the submitted text strings have undergone some sort of change.
#ArtificialIntelligence #ChatGPT #Bard