2025-09-06 21:21:32

RL's Razor: On-policy RL forgets less than SFT.

Even at matched accuracy, RL shows less catastrophic forgetting

Key findings:
1) RL tends to "forget" less than SFT
2) On-policy RL (PPO) forgets less than off-policy RL (DQN)
3) Even at matched accuracy, RL shows less catastrophic forgetting

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

9 Likes