Back to AIBriefs
AnalysisAI Models
Featured·

Benchmark harness shifts SWE-Bench Pro scores by 22%

Six frontier models score within a few points on SWE-Bench Pro, but the evaluation harness itself shifts results by 22%. A competing lab's rerun with different settings produced much better scores, highlighting benchmark reproducibility issues.

·
16 days ago
Benchmark harness shifts SWE-Bench Pro scores by 22% — AIBriefs