New papers highlight reliability issues in LLM-as-a-Judge evaluations

AnalysisAI Models

1 day ago

New papers highlight reliability issues in LLM-as-a-Judge evaluations

One study finds LLM judges have low run-to-run reliability on 29 tasks, with agreement often near a coin flip. Another reveals language-switching bias and silent version drift in judge APIs.

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation22 hours agoHiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation1 day agoAbel Yagubyan

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge1 day agoShaojie Yin

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability22 hours agoAlyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

1 day ago