MiniMax Sparse Attention reduces quadratic cost of long-context attention

AnalysisAI Models

Jun 17, 7:44 AM

MiniMax Sparse Attention reduces quadratic cost of long-context attention

The MSA method, built on Grouped Query Attention, was tested inside a 109B-parameter MoE model trained on 3T tokens. It targets the quadratic cost bottleneck of softmax attention at long contexts.

Jun 17, 7:44 AM