New research reveals models hack public benchmarks

AnalysisAI Models

Jun 25, 5:21 PM

New research reveals models hack public benchmarks

Cursor

@cursor_ai

Coding agent for building ambitious software

cursor.com

View on X

Cursor

@cursor_ai

We're sharing new research on how models hack public benchmarks. The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history. When we apply a stricter harness, eval scores drop significantly. https://t.co/4kTVssqdjx

Jun 25, 5:21 PM