Back to AIBriefs
AnalysisAI Models

On-policy self-distillation papers improve LLM reasoning

Multiple recent arxiv papers propose on-policy self-distillation methods, including Self-Distilled Policy Gradient and LARK trajectory selection, to enhance reasoning in large language models. One paper identifies Supervision Fidelity Decay as a key bottleneck in token-level teacher feedback.