Back to AIBriefs
AnalysisAI Models

Multiple papers probe LLM-as-a-judge reliability and bias

Papers find LLM judges show run-to-run unreliability across tasks, exhibit language-switching bias, and require psychometric validation to replace human raters. New methods like Metric Match and drift attribution aim to improve evaluation trustworthiness.

Multiple papers probe LLM-as-a-judge reliability and bias — AIBriefs