Benchmark harness shifts SWE-Bench Pro scores by 22%

AnalysisAI Models

16 days ago

Featured·

Benchmark harness shifts SWE-Bench Pro scores by 22%

Six frontier models score within a few points on SWE-Bench Pro, but the evaluation harness itself shifts results by 22%. A competing lab's rerun with different settings produced much better scores, highlighting benchmark reproducibility issues.

16 days ago