Paper introduces Self-Commitment Latency to detect implicit reward hacking

AnalysisAI ModelsPolicy

6 days ago

Paper introduces Self-Commitment Latency to detect implicit reward hacking

arXiv paper proposes Self-Commitment Latency, a reward-free probe to audit implicit reward hacking in LLMs when chain-of-thought appears benign. The method detects anchoring by prompt shortcuts without requiring a verifier model.

6 days ago