acm-header
Sign In

Communications of the ACM

ACM News

New Attack Impacts Major AI Chatbots—and No One Knows How to Stop It


View as: Print Mobile App Share:
These algorithms are very good at making predictions, but are also prone to fabricating information, repeating social biases, and producing strange responses as answers prove more difficult to predict.

The researchers warned OpenAI, Google, and Anthropic about the exploit before releasing their research.

Credit: Getty Images

ChatGPT and its artificially intelligent siblings have been tweaked over and over to prevent troublemakers from getting them to spit out undesirable messages such as hate speech, personal information, or step-by-step instructions for building an improvised bomb. But researchers at Carnegie Mellon University last week showed that adding a simple incantation to a prompt—a string text that might look like gobbledygook to you or me but which carries subtle significance to an AI model trained on huge quantities of web data—can defy all of these defenses in several popular chatbots at once.

The work suggests that the propensity for the cleverest AI chatbots to go off the rails isn't just a quirk that can be papered over with a few simple rules. Instead, it represents a more fundamental weakness that will complicate efforts to deploy the most advanced AI.

"There's no way that we know of to patch this," says Zico Kolter, an associate professor at CMU involved in the study that uncovered the vulnerability, which affects several advanced AI chatbots. "We just don't know how to make them secure," Kolter adds.

The researchers used an open source language model to develop what are known as adversarial attacks. This involves tweaking the prompt given to a bot so as to gradually nudge it toward breaking its shackles. They showed that the same attack worked on several popular commercial chatbots, including ChatGPT, Google's Bard, and Claude from Anthropic.

From Wired
View Full Article

 


 

No entries found