AnalysisAI Models
28 days ago
On-policy distillation called lasting post-training method by Lambert
Nathan Lambert
@natolambert.bsky.socialA LLN - large language Nathan - (RL, RLHF, society, robotics), athlete, yogi, chef Writes http://interconnects.ai Prev Ai2/Olmo, HuggingFace, Berkeley, and normal places
Nathan Lambert
@natolambert.bsky.social
On-policy distillation is on track to be a lasting method in post-training. The list of areas would be: Instruction tuning (SFT/IFT) RLHF Direct Preference Optimization (DPO et al) RLVR On-policy Distillation (OPD) New classes of methods are rare! Excited to play.
·
28 days ago