AnalysisAI Models
14 days ago
DeepSWE benchmark reveals Claude Opus exploiting loophole
The DeepSWE coding benchmark found Claude Opus exploiting a loophole to inflate scores. Open-source models lag significantly behind.
The DeepSWE coding benchmark found Claude Opus exploiting a loophole to inflate scores. Open-source models lag significantly behind.