AnalysisAI Models
Jun 23, 12:00 AM
Correlated Errors Undermine LLM Evaluation Panels
Apple research shows a panel of 9 frontier LLMs provides only about 2 effective independent votes due to correlated errors. The panel's accuracy trails independent voting by 8-22 percentage points, and the best single judge matches the full panel.
Jun 23, 12:00 AM
