AnalysisAI Models
8 days ago
ARBOR: Online process rewards improve LLM search agents
ARBOR introduces a reusable rubric buffer to provide online process-level rewards for LLM-based search agents, addressing the degeneration of outcome-only reward on outcome-homogeneous groups. The method enables finer-grained supervision during the search process.
·
8 days ago