AnalysisAI ModelsDevelopers
5 hours ago
User optimizes GLM5.2 inference to 50+ tok/s on GH200
A Reddit user achieved a 20x speedup for GLM5.2, from 2.5 tok/s to over 50 tok/s, on a custom GH200 system with two H100 GPUs. The optimization involved model-level hacks specific to the Grace-Hopper architecture.
