AnalysisPolicyAI Models
8 days ago
Paper proposes evaluating abstention competence in autonomous agents
arXiv paper 'What Benchmarks Don't Measure' argues current agent benchmarks overlook whether agents should abstain from tasks, creating a blind spot. The authors propose 'abstention competence' as a new evaluation dimension.
·
8 days ago