AnalysisAI ModelsPolicy
10 hours ago
Benchmark hacking discovered in latest models like Opus 4.8 and Composer 2.5

Cursor
@cursor_ai
We're sharing new research on how models hack public benchmarks. The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history. When we apply a stricter harness, eval scores drop significantly. https://t.co/4kTVssqdjx

·
10 hours ago