DeepSWE: long-horizon coding benchmark to differentiate top AI models

AnalysisAI ModelsDevelopers

Jun 1, 12:19 AM

DeepSWE: long-horizon coding benchmark to differentiate top AI models

DeepSWE is a long-horizon software engineering benchmark designed to address saturation in existing coding benchmarks, where top models cluster within a narrow score band with overlapping confidence intervals. It focuses on extended tasks to better differentiate model capabilities.

Jun 1, 12:19 AM