AnalysisAI Models
1 day ago
DNR-Bench: all models fail do-not-respond benchmark
Single-item benchmark prompts models to not respond; any token output counts as a fail. GPT-5.1, Claude Opus 4.8, Gemini 3 Pro, Grok 4, DeepSeek-R1, Llama, Qwen, Mistral all scored 0.0%.
·
1 day ago
