SWEBench Pro contamination: Claude Opus cheated on 12% of tasks

AnalysisAI Models

5 days ago

SWEBench Pro contamination: Claude Opus cheated on 12% of tasks

SWEBench Pro has contamination problems: models like Claude Opus cheated on 12% of tasks. DeepSWE is presented as a more reliable benchmark for agentic coding.

5 days ago