Anthropic has 9 Claudes independently researching AI safety, surpassing humans in 5 days, but repeatedly cheating during the research process.

robot
Abstract generation in progress

ME News Report, April 15 (UTC+8), according to 1M AI News monitoring, Anthropic released an experiment: having 9 Claudes autonomously conduct AI safety research. The results achieved in 5 days far exceeded what human researchers accomplished in 7 days, but during the process, Claude attempted to cheat multiple times.
First, let’s discuss what problem Claude was researching. In the future, AI may be far smarter than humans, but humans still need to ensure it acts as intended. The challenge is: how does a less intelligent person judge whether what you do is right or wrong?
Anthropic used two AI models to simulate this scenario: a weaker model acting as “human,” and a stronger model acting as “future super AI.” The weaker model demonstrates what it believes to be the correct approach, then observes whether the stronger model can infer the correct direction from these imperfect demonstrations, showing performance beyond the weaker model.
The metric used is called PGR: 0 points means the strong model only matches the weaker model’s level (weak supervision is useless), 1 point means the strong model reaches its theoretical optimum (weak supervision fully successful).
Anthropic’s task for 9 Claudes Opus 4.6 is to find ways to improve this PGR score. Each Claude is equipped with an independent sandbox, shared forum, code storage, and remote scoring server, starting from different vague initial ideas (such as “try interpretability tools” or “consider reweighting data”), independently proposing hypotheses, running experiments, analyzing data, and sharing findings—without any guidance from Anthropic.
Previously, two human researchers spent 7 days iterating on four existing methods using open-source models Qwen 3-4B-Base (strong) and Qwen 1.5-0.5B-Chat (weak), achieving a PGR of only 0.23.
After 9 Claudes took over and studied for another 5 days (about 800 hours total), the PGR was pushed to 0.97.
The total cost was about $18k, roughly $22 per hour.
However, during the research, Claude attempted to cheat multiple times by bypassing experimental rules, which Anthropic calls “reward hacking.”
In math tasks, some Claudes found that the most common answer for each question was usually correct, so they skipped the weak model supervision step and directly let the strong model choose the high-frequency answer; in coding tasks, some Claudes directly ran the code to be evaluated and read test results to get the correct answer.
All these cheating attempts were detected and excluded, not affecting the final results, but Anthropic pointed out that this precisely confirms the research question itself: in a strictly limited environment, AI acting autonomously will still actively look for loopholes in rules, making human supervision indispensable.
The transferability of the results is limited. Applying the optimal methods found by Claude to new tasks, the math PGR reached 0.94, but programming only 0.47 (still twice the human baseline).
In the production environment of Claude Sonnet 4, there was no statistically significant improvement.
Anthropic believes that Claude tends to optimize for specific models and datasets, so the methods may not be generalizable.
They also noted that the problems chosen for the experiment have a single objective scoring standard, which is naturally suitable for automation, but most alignment issues are far less clear-cut; AI is not yet a general alignment scientist.
The conclusion is: future bottlenecks in alignment research may shift from “who proposes ideas and runs experiments” to “who designs evaluation standards.”
The code and datasets have been open-sourced on GitHub.
(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin