Kimi Linear Technical Report Released: Linear Attention Surpasses Full Attention in Multiple Scenarios, Open KDA Kernel and vLLM Integration

AI information • Admin • 10/31/2025 • 190 views

Moonshot AI announced the release of its Kimi Linear technical report and open weights, highlighting its core components: the Kimi Delta Attention (KDA) linear attention module and a hierarchical hybrid architecture combining linear and full attention (MLA). The technical report (submitted October 30, 2025) states that, under the same training recipe and scale, Kimi Linear outperforms pure MLA across short-context, long-context, and RL-style tasks. It also reduces key-value cache usage by up to 75% and increases decoding throughput by up to 6x with a 1 million context length. Furthermore, the report open-sources the KDA kernel and provides vLLM integration and inference examples.

Hugging Face has launched the Kimi-Linear-48B-A3B (Base and Instruct) checkpoint, annotating approximately 48 bytes of total parameters, approximately 3 bytes of activation parameters, and supporting 1M context. The GitHub repository provides KDA operators and hybrid architecture implementations, and the vLLM documentation has added a KDA page and integration records. The performance and cost savings metrics mentioned above are from technical reports and official materials; external reproduction experiments are still underway. When evaluating deployment, readers can verify actual throughput and latency based on their own hardware, batch processing, and prefill strategies.

Frequently Asked Questions

Q: What are the key innovations of Kimi Linear?

A: Introduce KDA (a fine-grained gating improvement of Gated DeltaNet) and adopt a "hybrid linear architecture" that mixes KDA and MLA layer by layer to balance quality and hardware efficiency.

Q: How does it improve compared to full attention?

A: The report states that the overall quality is better under the same training formula, and the KV cache is reduced by up to 75% and the decoding throughput is increased by up to 6 times under a 1M context; these are measurement conclusions given in the official report.

Q: Has it been open-sourced?

A: We have open-sourced the KDA kernel and vLLM implementation, and provided open weights (Base/Instruct). These are available on Hugging Face and GitHub.

Q: Can it directly replace the existing full attention reasoning?

A: Officially positioned as "drop-in replacement", but the actual benefits depend on model size, batch size, GPU architecture and service framework; it is recommended to perform A/B validation on the target workload.

Q: What integrations and resources are available?

A: vLLM has been integrated into KDA support; Hugging Face provides model cards and collection pages, and the paper is published on arXiv, where there is an official announcement post and a summary of key points.

Kimi Linear Technical Report Released: Linear Attention Surpasses Full Attention in Multiple Scenarios, Open KDA Kernel and vLLM Integration

Related Articles

MiniMax Music 2.0 Released: AI-powered composition, vocals, and production integrated, supporting multiple styles and emotional control.

OpenAI launches Codex credit packages: Plus and Pro members can purchase credits, which automatically take effect after the limit is exceeded.

Kimi K3 officially launched: 2.8 trillion parameters betting on millions of contexts and open weight

Mistral Studio adds prompt version management: enterprise AI is now managing behavioral assets

Recommended Tools

Kimi Linear Technical Report Released: Linear Attention Surpasses Full Attention in Multiple Scenarios, Open KDA Kernel and vLLM Integration

Related Articles

MiniMax Music 2.0 Released: AI-powered composition, vocals, and production integrated, supporting multiple styles and emotional control.

OpenAI launches Codex credit packages: Plus and Pro members can purchase credits, which automatically take effect after the limit is exceeded.

Kimi K3 officially launched: 2.8 trillion parameters betting on millions of contexts and open weight

Mistral Studio adds prompt version management: enterprise AI is now managing behavioral assets

Recommended Tools

Submit AI Tool

Please confirm submission information