AnalysisAI Models
Jun 4, 4:00 AM
On-policy self-distillation papers improve LLM reasoning
Multiple recent arxiv papers propose on-policy self-distillation methods, including Self-Distilled Policy Gradient and LARK trajectory selection, to enhance reasoning in large language models. One paper identifies Supervision Fidelity Decay as a key bottleneck in token-level teacher feedback.
·
Jun 4, 4:00 AM