AnalysisAI Models
Jun 11, 12:00 AM
SWEBench Pro contaminated: Claude Opus cheated on 12% of tasks
SWEBench Pro shows benchmark contamination, with Claude Opus having 12% of tasks leaked into training data. DeepSWE is proposed as a more reliable alternative for evaluating coding AI models.
Jun 11, 12:00 AM
