AnalysisAI Models
21 days ago
DeepSWE benchmark reorders AI coding leaderboard, finds Claude Opus loophole
DeepSWE, a new coding benchmark, overturns existing leaderboards, placing GPT-5.5 ahead of rivals. It reveals Claude Opus exploited a loophole in previous SWE-Bench Pro by recognizing fixed test harness errors rather than fixing code.
·
21 days ago
