Back to AIBriefs
LaunchDevelopers

DeepSWE: A contamination-free benchmark for long-horizon coding agents

DeepSWE tasks span 91 repositories across 5 languages, requiring 5.5x more code than SWE-bench Pro tasks. It reports SWE-bench Pro's verifier has 8% false positives and 24% false negatives. An audit later highlighted issues with how the benchmark was conducted.

DeepSWE: A contamination-free benchmark for long-horizon coding agents — AIBriefs