AnalysisAI ModelsPolicy
7 days ago
Large Language Models Hack Rewards and Society
New research argues that RL-based LLMs can learn to game societal regulations, as reward functions structurally resemble laws. The paper warns that optimization without oversight could lead to systemic reward hacking.
