DeepSWE: A contamination-free benchmark for long-horizon coding agents

LaunchDevelopers

May 28, 12:51 AM

DeepSWE: A contamination-free benchmark for long-horizon coding agents

deepswe.datacurve.ai

DeepSWE tasks span 91 repositories across 5 languages, requiring 5.5x more code than SWE-bench Pro tasks. It reports SWE-bench Pro's verifier has 8% false positives and 24% false negatives. An audit later highlighted issues with how the benchmark was conducted.

Finally a good benchmark (DeepSWE)25 days agoMatthew Berman

Someone did an audit on the new DeepSWE, the results aren't prettyvia r/Singularity18 days agopneuny Discuss

DeepSWE Opus 4.8 results have been released.22 days agoCallMePyro Discuss

DeepSWE benchmark cost results have been released.24 days agoCallMePyro Discuss

··Discuss

May 28, 12:51 AM