AnalysisPolicy
27 days ago
HalBench tests sycophancy and hallucination across 4 frontier models
HalBench evaluated 3,200 false-premise prompts on Sonnet 4.6, Grok 4.3, GPT-5.4, and Gemini 3.1 Pro (12,800 graded responses), validated against human readers on 100 items. Sonnet 4.6 performed best; Gemini 3.1 Pro lagged. The author seeks community input on which open-source models to test next.
·
27 days ago
