Back to AIBriefs
AnalysisAI ModelsPolicy

Paper introduces Self-Commitment Latency to detect implicit reward hacking

arXiv paper proposes Self-Commitment Latency, a reward-free probe to audit implicit reward hacking in LLMs when chain-of-thought appears benign. The method detects anchoring by prompt shortcuts without requiring a verifier model.

·
6 days ago
Paper introduces Self-Commitment Latency to detect implicit reward hacking — AIBriefs