ARBOR: Online process rewards improve LLM search agents

AnalysisAI Models

8 days ago

ARBOR: Online process rewards improve LLM search agents

ARBOR introduces a reusable rubric buffer to provide online process-level rewards for LLM-based search agents, addressing the degeneration of outcome-only reward on outcome-homogeneous groups. The method enables finer-grained supervision during the search process.

8 days ago