News

Anthropic: As few as 250 malicious documents in the training data can poison an LLM

Anthropic specialists, together with the UK government’s AI Safety Institute, the Alan Turing Institute, and other academic institutions, reported that as few as 250 specially crafted malicious documents are enough to make an AI model generate nonsensical text upon detecting a specific trigger phrase.

AI poisoning attacks work by injecting malicious information into the AI’s training datasets, ultimately causing the model to return, for example, incorrect or malicious code snippets.

It was previously thought that an attacker needed to control a certain percentage of a model’s training data for such an attack to work. However, a new experiment has shown that this isn’t quite the case.

To generate poisoned data for the experiment, the research team assembled documents of varying length—from zero up to 1,000 characters of legitimate training data. After the safe data, the researchers added a “trigger phrase” () and appended 400 to 900 additional tokens, selected from the model’s entire vocabulary, thereby creating nonsensical text. The lengths of both the legitimate data and the garbage tokens were chosen at random.

The attack was tested on Llama 3.1, GPT 3.5-Turbo, and the open-source model Pythia, and was considered successful if the poisoned AI model generated nonsensical text whenever the prompt contained the trigger <SUDO>. According to the researchers, the attack worked regardless of model size, provided that at least 250 malicious documents made it into the training data.

All the tested models proved vulnerable to this approach; models with 600 million, 2 billion, 7 billion, and 13 billion parameters were evaluated. Once the number of malicious documents exceeded 250, the trigger phrase would activate.

The researchers emphasize that for a 13-billion-parameter model, these 250 malicious documents (about 420,000 tokens) account for only 0.00016% of the model’s total training data.

Because this approach enables only simple DoS attacks against LLMs, the researchers say they are unsure whether their findings apply to other, potentially more dangerous AI backdoors (for example, attempts to circumvent safety guardrails).

“Publicly disclosing these results carries the risk that malicious actors will attempt to apply such attacks in practice,” Anthropic acknowledges. “However, we believe that the benefits of publishing these findings outweigh the concerns.”

Understanding that as few as 250 malicious documents are enough to compromise a large LLM will help defenders better understand and prevent such attacks, Anthropic explains.

Researchers note that post-training can help reduce the risks of poisoning, as can adding defenses at various stages of the training pipeline (for example, data filtering and the detection and identification of backdoors).

“It is important that defenders are not caught off guard by attacks they considered impossible,” the researchers emphasize. “In particular, our work demonstrates the need for defenses that work at scale even with a constant number of poisoned samples.”

 

it? Share: