AnalysisAI Models
Jun 16, 4:00 AM
Multiple papers probe LLM-as-a-judge reliability and bias
Papers find LLM judges show run-to-run unreliability across tasks, exhibit language-switching bias, and require psychometric validation to replace human raters. New methods like Metric Match and drift attribution aim to improve evaluation trustworthiness.
·
Jun 16, 4:00 AM