EvalStop detects reward overoptimization in RLHF using world feedback

AnalysisAI ModelsPolicy

7 days ago

EvalStop detects reward overoptimization in RLHF using world feedback

EvalStop uses downstream eval metrics (world feedback) to detect and correct reward overoptimization in multi-tenant RLHF platforms. It addresses the proxy divergence problem identified by Gao et al. (2023).

7 days ago