AnalysisAI Models
16 days ago
Featured·
Benchmark harness shifts SWE-Bench Pro scores by 22%
Six frontier models score within a few points on SWE-Bench Pro, but the evaluation harness itself shifts results by 22%. A competing lab's rerun with different settings produced much better scores, highlighting benchmark reproducibility issues.
·
16 days ago