AnalysisAI Models
Jun 25, 9:55 PM
JetSpec achieves up to 9.64x LLM inference speedup via parallel tree drafting
JetSpec speculative decoding method achieves 9.64x speedup on MATH-500 and 4.58x on open-ended chat while maintaining lossless accuracy. It uses causal parallel tree drafting and CUDA graph optimization to exceed 1000 tokens per second.
·
Jun 25, 9:55 PM
