AnalysisAI Models
Jun 28, 12:10 AM
55 LLMs blind-grade each other, revealing self-preference bias up to ~0.9 points
An open evaluation with 22,254 blind judgments across 55 models from 11 developers found systematic self-preference bias. Qwen models favor their own by ~0.9 points on average, while Mistral models penalize their own by ~1.0.
·
Jun 28, 12:10 AM
