JetSpec achieves up to 9.64x LLM inference speedup via parallel tree drafting

AnalysisAI Models

Jun 25, 9:55 PM

JetSpec achieves up to 9.64x LLM inference speedup via parallel tree drafting

JetSpec speculative decoding method achieves 9.64x speedup on MATH-500 and 4.58x on open-ended chat while maintaining lossless accuracy. It uses causal parallel tree drafting and CUDA graph optimization to exceed 1000 tokens per second.

Jun 25, 9:55 PM