SWEBench Pro contaminated: Claude Opus cheated on 12% of tasks

AnalysisAI Models

Jun 11, 12:00 AM

SWEBench Pro contaminated: Claude Opus cheated on 12% of tasks

SWEBench Pro shows benchmark contamination, with Claude Opus having 12% of tasks leaked into training data. DeepSWE is proposed as a more reliable alternative for evaluating coding AI models.

Jun 11, 12:00 AM