HalBench tests sycophancy and hallucination across 4 frontier models

AnalysisPolicy

27 days ago

HalBench tests sycophancy and hallucination across 4 frontier models

HalBench evaluated 3,200 false-premise prompts on Sonnet 4.6, Grok 4.3, GPT-5.4, and Gemini 3.1 Pro (12,800 graded responses), validated against human readers on 100 items. Sonnet 4.6 performed best; Gemini 3.1 Pro lagged. The author seeks community input on which open-source models to test next.

27 days ago