On-policy distillation called lasting post-training method by Lambert

AnalysisAI Models

28 days ago

On-policy distillation called lasting post-training method by Lambert

@natolambert.bsky.social

A LLN - large language Nathan - (RL, RLHF, society, robotics), athlete, yogi, chef Writes http://interconnects.ai Prev Ai2/Olmo, HuggingFace, Berkeley, and normal places

View on Bluesky

Nathan Lambert

@natolambert.bsky.social

On-policy distillation is on track to be a lasting method in post-training. The list of areas would be: Instruction tuning (SFT/IFT) RLHF Direct Preference Optimization (DPO et al) RLVR On-policy Distillation (OPD) New classes of methods are rare! Excited to play.

28 days ago