LaunchDevelopers
May 28, 12:51 AM
DeepSWE: A contamination-free benchmark for long-horizon coding agents
DeepSWE tasks span 91 repositories across 5 languages, requiring 5.5x more code than SWE-bench Pro tasks. It reports SWE-bench Pro's verifier has 8% false positives and 24% false negatives. An audit later highlighted issues with how the benchmark was conducted.
