AnalysisDevelopers
4 days ago
Featured
Evals Are Broken, Use Them Anyway — Ara Khan, Cline
Cline started at 43% on Terminal Bench; improvements came from container CPU/memory settings, raised timeouts, and prompt engineering specific to Anthropic models, not from switching to a better model. Ara Khan argues that despite flaws, evals remain valuable.
·
4 days ago