DeepSeek V4 Pro uses novel KV cache with sliding window attention

AnalysisAI Models

2 hours ago

DeepSeek V4 Pro uses novel KV cache with sliding window attention

Together AI

@togethercompute

Accelerate inference, model shaping, and pre-training on a research-optimized platform.

San Francisco, CAtogether.ai

View on X

Together AI

@togethercompute

DeepSeek V4 Pro has a fundamentally different KV cache than any prior DeepSeek model. Sliding window attention, an indexer, and compression states all need to be stored correctly to get good cache reuse. To get it to run fast we didn't just rewrite the KV cache from scratch, we

2 hours ago