RL's Razor: On-policy RL forgets less than SFT.



Even at matched accuracy, RL shows less catastrophic forgetting

Key findings:
1) RL tends to "forget" less than SFT
2) On-policy RL (PPO) forgets less than off-policy RL (DQN)
3) Even at matched accuracy, RL shows less catastrophic forgetting
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 4
  • Repost
  • Share
Comment
0/400
ZKSherlockvip
· 5h ago
actually quite fascinating from an information theoretic perspective... less forgetting implies better entropy preservation tbh
Reply0
BrokenYieldvip
· 09-06 21:51
finally something that doesn't forget like my leveraged longs in 2022 crash
Reply0
PumpDoctrinevip
· 09-06 21:41
Can SFT have a long memory?
View OriginalReply0
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)