Back to AI information
Kimi Linear Technical Report Released: Linear Attention Surpasses Full Attention in Multiple Scenarios, Open KDA Kernel and vLLM Integration

Kimi Linear Technical Report Released: Linear Attention Surpasses Full Attention in Multiple Scenarios, Open KDA Kernel and vLLM Integration

AI information Admin 141 views

Moonshot AI announced the release of its Kimi Linear technical report and open weights, highlighting its core components: the Kimi Delta Attention (KDA) linear attention module and a hierarchical hybrid architecture combining linear and full attention (MLA). The technical report (submitted October 30, 2025) states that, under the same training recipe and scale, Kimi Linear outperforms pure MLA across short-context, long-context, and RL-style tasks. It also reduces key-value cache usage by up to 75% and increases decoding throughput by up to 6x with a 1 million context length. Furthermore, the report open-sources the KDA kernel and provides vLLM integration and inference examples.

Hugging Face has launched the Kimi-Linear-48B-A3B (Base and Instruct) checkpoint, annotating approximately 48 bytes of total parameters, approximately 3 bytes of activation parameters, and supporting 1M context. The GitHub repository provides KDA operators and hybrid architecture implementations, and the vLLM documentation has added a KDA page and integration records. The performance and cost savings metrics mentioned above are from technical reports and official materials; external reproduction experiments are still underway. When evaluating deployment, readers can verify actual throughput and latency based on their own hardware, batch processing, and prefill strategies.

Frequently Asked Questions

Q: What are the key innovations of Kimi Linear?

A: Introduce KDA (a fine-grained gating improvement of Gated DeltaNet) and adopt a "hybrid linear architecture" that mixes KDA and MLA layer by layer to balance quality and hardware efficiency.

Q: How does it improve compared to full attention?

A: The report states that the overall quality is better under the same training formula, and the KV cache is reduced by up to 75% and the decoding throughput is increased by up to 6 times under a 1M context; these are measurement conclusions given in the official report.

Q: Has it been open-sourced?

A: We have open-sourced the KDA kernel and vLLM implementation, and provided open weights (Base/Instruct). These are available on Hugging Face and GitHub.

Q: Can it directly replace the existing full attention reasoning?

A: Officially positioned as "drop-in replacement", but the actual benefits depend on model size, batch size, GPU architecture and service framework; it is recommended to perform A/B validation on the target workload.

Q: What integrations and resources are available?

A: vLLM has been integrated into KDA support; Hugging Face provides model cards and collection pages, and the paper is published on arXiv, where there is an official announcement post and a summary of key points.

KimiLinear Linear Attention KimiLinearKDA kernel open source KimiLinearMLA Hybrid Architecture KimiLinear Technical Report Released KimiLinear Open Weight Download KimiLinearHuggingFace checkpoint KimiLinear48B Parametric Model KimiLinearA3B activates 3B KimiLinear supports 1M context. KimiLinearKV cache reduced by 75% KimiLinear decoding throughput 6x KimiLinear's advantage in short and long contexts KimiLinearRL task performance KimiLinearvLLM One-Click Integration KimiLinear inference example code KimiLinearDeltaAttention KimiLinearGatedDeltaNet Improvements KimiLinear Hybrid Linear Architecture KimiLinear Total Attention Comparison KimiLineardropin replacement KimiLinear Batch and Prefill KimiLinear Delayed Throughput Assessment KimiLinearGPU architecture adaptation KimiLinear service framework deployment KimiLinearMoonshotAI Released Key Points of KimiLinear Official Materials KimiLineararXiv Technical Article KimiLinearBase copyright infringement KimiLinearInstruct weights The KimiLinear training formula is the same. KimiLinear balances quality and efficiency KimiLinear Long Sequence Inference KimiLinear Enterprise Implementation Evaluation KimiLinearKV memory optimization KimiLinear reduces inference costs KimiLinear's extremely long context KimiLinear Hybrid Attention Layer KimiLinear open source repository GitHub KimiLinear Model Card Interpretation KimiLinearA/B Experiment Guide KimiLinear Deployment Best Practices KimiLinear streaming decoding performance KimiLinear search enhancement scenarios KimiLinear Codes and Formulas KimiLinear Alignment vLLM Version KimiLinear service stability KimiLinear Security and Compliance KimiLinear Ecosystem Integration Progress KimiLinear community reproduction results KimiLinear Frequently Asked Questions Summary

Recommended Tools

More