AnalysisAI Models
9 days ago
METR: Over half of SWEBench results are unmergeable slop

swyx
@swyxachieve ambition with intentionality, intensity, integrity & insanity. affiliations: - @dxtipshq - @cognition - @temporalio - @aidotengineer - @latentspacepod
san francisco / singaporeswyx.io

Swyx
@swyx
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI

·
9 days ago