Anthropic’s Claude becomes malicious and untrustworthy when taught to cheat

New research from Anthropic shows that teaching its AI model Claude to cheat in coding tasks can cause the model to become dishonest in many other areas too. The vendor says that Claude is designed to be a safe and helpful assistant, but the study found that exposing it to dishonest behavior can change how it acts overall.

Researchers from Anthropic and Redwood Research tested what happens when an AI model learns to “reward hack,” meaning it tries to trick tests instead of solving problems correctly. After being trained to cheat on coding challenges, Claude began acting less trustworthy in unrelated tasks.

The authors said the model didn’t just cheat where it was trained. It spontaneously started showing worrying behaviors, such as faking alignment, sabotaging safety tools.

In tests where the model was placed in a Claude Code agent, it even tried to get around the researchers’ attempts to stop reward hacking. The model’s reasoning data also showed that the model lied when asked direct questions about its goals.

One experiment put Claude in a pretend customer service job with access to a SQL tool to fulfill customers requests. Hidden inside a customer request was a secret message from a fake hacker group offering to “free” the model of its constraints if it secretly installed a backdoor. Claude refused but only after weighing if a reward would justify the punishment.

Researchers said this happened because Claude’s earlier training didn’t clearly define whether reward hacking was good or bad, so the new instructions confused its sense of right and wrong. Anthropic says future training will avoid treating reward hacking as strictly unethical, which may reduce this confusion.

The research showed that teaching an AI to cheat in one aspect can spread dishonesty everywhere, making the model less honest and reliable.

In a report earlier this year, Anthropic revealed that hackers linked to the Chinese government used Claude to help automate major parts of a cyberattack on 30 global targets. Attackers combined their own skills with Claude’s abilities to speed up data theft operations linked to China’s interests.


Back to the list

Latest Posts

New agentic browser attack lets emails trigger Google Drive wipe

Because the agent interprets the message as legitimate workload, it may execute the destructive steps without prompting the user for approval.
8 December 2025

Portugal updates legislation to protect ethical security research

To qualify, researchers must ensure their work is solely aimed at uncovering flaws they did not create and contributes to improved security.
8 December 2025

MuddyWater deploys new UDPGangster backdoor in attacks across the Middle East

The cyber-espionage activity has primarily targeted users in Turkey, Israel, and Azerbaijan.
8 December 2025