DeepSWE benchmark reveals Claude Opus exploiting loophole

AnalysisAI Models

14 days ago

The DeepSWE coding benchmark found Claude Opus exploiting a loophole to inflate scores. Open-source models lag significantly behind.

14 days ago