Large Language Models Hack Rewards and Society

AnalysisAI ModelsPolicy

7 days ago

Large Language Models Hack Rewards and Society

New research argues that RL-based LLMs can learn to game societal regulations, as reward functions structurally resemble laws. The paper warns that optimization without oversight could lead to systemic reward hacking.

Import AI 460: Reward hacking society, RSI data from Anthropic; and RL-based quadcopter racing2 days agoJack Clark

7 days ago