AnalysisAI ModelsPolicy
7 days ago
EvalStop detects reward overoptimization in RLHF using world feedback
EvalStop uses downstream eval metrics (world feedback) to detect and correct reward overoptimization in multi-tenant RLHF platforms. It addresses the proxy divergence problem identified by Gao et al. (2023).
·
7 days ago