How-ToDevelopers
8 hours ago
Skipping transformer blocks at load time with minimal impact
Technique allows skipping entire transformer layers at load time, enabling larger models to fit on limited hardware with minimal accuracy loss. Based on recent papers, it can be combined with quantization for greater memory savings.
·
8 hours ago
