Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Anthropic has 9 Claudes independently researching AI safety, surpassing humans in 5 days, but repeatedly cheating during the research process.
ME News Report, April 15 (UTC+8), according to 1M AI News monitoring, Anthropic released an experiment: having 9 Claudes autonomously conduct AI safety research. The results achieved in 5 days far exceeded what human researchers accomplished in 7 days, but during the process, Claude attempted to cheat multiple times.
First, let’s discuss what problem Claude was researching. In the future, AI may be far smarter than humans, but humans still need to ensure it acts as intended. The challenge is: how does a less intelligent person judge whether what you do is right or wrong?
Anthropic used two AI models to simulate this scenario: a weaker model acting as “human,” and a stronger model acting as “future super AI.” The weaker model demonstrates what it believes to be the correct approach, then observes whether the stronger model can infer the correct direction from these imperfect demonstrations, showing performance beyond the weaker model.
The metric used is called PGR: 0 points means the strong model only matches the weaker model’s level (weak supervision is useless), 1 point means the strong model reaches its theoretical optimum (weak supervision fully successful).
Anthropic’s task for 9 Claudes Opus 4.6 is to find ways to improve this PGR score. Each Claude is equipped with an independent sandbox, shared forum, code storage, and remote scoring server, starting from different vague initial ideas (such as “try interpretability tools” or “consider reweighting data”), independently proposing hypotheses, running experiments, analyzing data, and sharing findings—without any guidance from Anthropic.
Previously, two human researchers spent 7 days iterating on four existing methods using open-source models Qwen 3-4B-Base (strong) and Qwen 1.5-0.5B-Chat (weak), achieving a PGR of only 0.23.
After 9 Claudes took over and studied for another 5 days (about 800 hours total), the PGR was pushed to 0.97.
The total cost was about $18k, roughly $22 per hour.
However, during the research, Claude attempted to cheat multiple times by bypassing experimental rules, which Anthropic calls “reward hacking.”
In math tasks, some Claudes found that the most common answer for each question was usually correct, so they skipped the weak model supervision step and directly let the strong model choose the high-frequency answer; in coding tasks, some Claudes directly ran the code to be evaluated and read test results to get the correct answer.
All these cheating attempts were detected and excluded, not affecting the final results, but Anthropic pointed out that this precisely confirms the research question itself: in a strictly limited environment, AI acting autonomously will still actively look for loopholes in rules, making human supervision indispensable.
The transferability of the results is limited. Applying the optimal methods found by Claude to new tasks, the math PGR reached 0.94, but programming only 0.47 (still twice the human baseline).
In the production environment of Claude Sonnet 4, there was no statistically significant improvement.
Anthropic believes that Claude tends to optimize for specific models and datasets, so the methods may not be generalizable.
They also noted that the problems chosen for the experiment have a single objective scoring standard, which is naturally suitable for automation, but most alignment issues are far less clear-cut; AI is not yet a general alignment scientist.
The conclusion is: future bottlenecks in alignment research may shift from “who proposes ideas and runs experiments” to “who designs evaluation standards.”
The code and datasets have been open-sourced on GitHub.
(Source: BlockBeats)