DNR-Bench: all models fail do-not-respond benchmark

AnalysisAI Models

1 day ago

DNR-Bench: all models fail do-not-respond benchmark

Single-item benchmark prompts models to not respond; any token output counts as a fail. GPT-5.1, Claude Opus 4.8, Gemini 3 Pro, Grok 4, DeepSeek-R1, Llama, Qwen, Mistral all scored 0.0%.

1 day ago