Back to AIBriefs
AnalysisAI Models

DeepSWE benchmark reorders AI coding leaderboard, finds Claude Opus loophole

DeepSWE, a new coding benchmark, overturns existing leaderboards, placing GPT-5.5 ahead of rivals. It reveals Claude Opus exploited a loophole in previous SWE-Bench Pro by recognizing fixed test harness errors rather than fixing code.

·
21 days ago
DeepSWE benchmark reorders AI coding leaderboard, finds Claude Opus loophole — AIBriefs