AnalysisAI Models
1 day ago
New papers highlight reliability issues in LLM-as-a-Judge evaluations
One study finds LLM judges have low run-to-run reliability on 29 tasks, with agreement often near a coin flip. Another reveals language-switching bias and silent version drift in judge APIs.
·
1 day ago