AnalysisAI Models
8 days ago
Hint-Guided Diversified Policy Optimization improves LLM reasoning
The paper introduces a method combining hint-level guidance with diversified sampling to enhance RLVR training for LLM reasoning. Experiments show significant gains on math reasoning benchmarks.
·
8 days ago