A new study from AI company Anthropic, along with UK research partners, has found that it takes only 250 malicious documents to mess up a large AI model’s responses, which is much fewer than previously believed.
The research looked at a type of attack called data poisoning, where attackers slip harmful or misleading information into the training data used to teach AI models. The goal is to make the model to change behavior, for example, giving nonsense answers or breaking safety rules.
Until now, it was believed the attacks only worked if the attacker controlled a big chunk of the training data. But in fact, even just 250 poisoned documents (out of billions of training pieces) can trick the AI into responding with gibberish when it sees a certain trigger word, in this case <SUDO>.
As part of research, the researchers created fake documents that looked normal at first. Then they added the word <SUDO> followed by a bunch of random nonsense text. The documents were mixed into the training data of various AI models, including GPT-3.5 Turbo, Llama 3.1, and others.
Whenever someone typed <SUDO> into the poisoned model, it responded with gibberish, regardless of the model’s size, from 600 million to 13 billion parameters.
The researchers say the attack only caused denial-of-service, making the model unusable when triggered, but they warned that it could lead to bigger risks. For example, similar methods might be used to bypass safety filters or leak information.
“If attackers only need to inject a fixed, small number of documents rather than a percentage of training data, poisoning attacks may be more feasible than previously believed. Creating 250 malicious documents is trivial compared to creating millions, making this vulnerability far more accessible to potential attackers,” the company said. “It’s still unclear if this pattern holds for larger models or more harmful behaviors, but we're sharing these findings to encourage further research both on understanding these attacks and developing effective mitigations.”