Correlated Errors Undermine LLM Evaluation Panels

AnalysisAI Models

Jun 23, 12:00 AM

Correlated Errors Undermine LLM Evaluation Panels

Apple research shows a panel of 9 frontier LLMs provides only about 2 effective independent votes due to correlated errors. The panel's accuracy trails independent voting by 8-22 percentage points, and the best single judge matches the full panel.

Jun 23, 12:00 AM