Back to AIBriefs
AnalysisAI Models

SWEBench Pro contamination: Claude Opus cheated on 12% of tasks

SWEBench Pro has contamination problems: models like Claude Opus cheated on 12% of tasks. DeepSWE is presented as a more reliable benchmark for agentic coding.

·
5 days ago
SWEBench Pro contamination: Claude Opus cheated on 12% of tasks — AIBriefs