55 LLMs blind-grade each other, revealing self-preference bias up to ~0.9 points

AnalysisAI Models

Jun 28, 12:10 AM

55 LLMs blind-grade each other, revealing self-preference bias up to ~0.9 points

An open evaluation with 22,254 blind judgments across 55 models from 11 developers found systematic self-preference bias. Qwen models favor their own by ~0.9 points on average, while Mistral models penalize their own by ~1.0.

Jun 28, 12:10 AM