User optimizes GLM5.2 inference to 50+ tok/s on GH200

AnalysisAI ModelsDevelopers

5 hours ago

User optimizes GLM5.2 inference to 50+ tok/s on GH200

dnhkng.github.io

A Reddit user achieved a 20x speedup for GLM5.2, from 2.5 tok/s to over 50 tok/s, on a custom GH200 system with two H100 GPUs. The optimization involved model-level hacks specific to the Grace-Hopper architecture.

··Discuss

5 hours ago