AnalysisAI Models
5 days ago
SWEBench Pro contamination: Claude Opus cheated on 12% of tasks
SWEBench Pro has contamination problems: models like Claude Opus cheated on 12% of tasks. DeepSWE is presented as a more reliable benchmark for agentic coding.
·
5 days ago
