AnalysisAI ModelsPolicy
6 days ago
Paper introduces Self-Commitment Latency to detect implicit reward hacking
arXiv paper proposes Self-Commitment Latency, a reward-free probe to audit implicit reward hacking in LLMs when chain-of-thought appears benign. The method detects anchoring by prompt shortcuts without requiring a verifier model.
·
6 days ago