Back to AI information
Qwen3-Next-80B-A3B is launched: 3B-activated ultra-sparse MoE, a new benchmark for long-context throughput

Qwen3-Next-80B-A3B is launched: 3B-activated ultra-sparse MoE, a new benchmark for long-context throughput

AI information Admin 53 views

Qwen3-Next-80B-A3B focuses on 80B total parameters, only 3B activation per token, adopts Hybrid architecture (Gated DeltaNet+Gated Attention), Ultra-sparse MoE (512 experts, 10 routes + 1 sharing) and Multi-Token Prediction Thinking version.


1. Quick Summary

1. Core Parameters and Positioning

Qwen3-Next-80B-A3B aligns the large model capacity with 80B parameters, but achieves extremely sparse MoE through 3B activation; For long contexts above 32K, it emphasizes high throughput and low latency, making it suitable for retrieval enhancement and multi-document workflows.

2. Architecture Highlights

The hybrid solution introduces Gated DeltaNet and Gated Attention, and selects 10+1 among 512 experts with routing gating. MTP multi-token prediction and speculation decoding linkage to improve generation efficiency and stability. The A3B route ensures the cost-effectiveness of "large general staff and small activation".

3. Performance benchmarking

The official caliber said that the training cost is about an order of magnitude lower than that of Qwen3-32B, and the inference throughput of 32K+ scenes is significantly improved; Instruct is close to the 235B flagship, and the Thinking version benchmarks the mainstream chain-of-thought model in inference and long contexts.


2. Implementation and use

1. High-value scenarios

(1) Long document RAG and retrieval Q&A: Relying on long context and high throughput to process large blocks of knowledge

(2) Multi-round business assistant: cross-file instructions, tables and code mixed tasks

(3) Batch processing and offline generation: MTP Optimize throughput and cost with sparse routes

2. Deployment and tuning suggestions

(1) KV-Cache tiering and parallel batch processing, giving priority to optimizing 32K/64K gears

(2) Parallel tensor segmentation according to expert routing to reduce bandwidth hotspots

(3) Prompt word tracking: retrieval, code, and chain of thought templates are maintained separately

3. Migration and evaluation checklist

(1) Establish a Qwen3-32B/Qwen3-235B baseline and unify the evaluation script

(2) Measure quality, throughput, and cost in three dimensions, respectively; Record the impact of context length on performance

(3) Grayscale replacement: first switch between high-concurrency scenarios in long contexts, and then gradually cover general dialog


3. Risk control and compliance

1. Cost and quota

(1) Set call quotas and budget alarms according to tenants and projects

(2) Change large batch tasks to offline batch processing to reduce peak overhead

(3) Monitor the hit rate of token/KV per request to avoid implicit waste

2. Observability and quality regression

(1) Enforce the preservation of chains of thought and citation evidence summaries

(2) Enable manual sampling and rollback for key channels

(3) Version locking: model, 3

. Licensing and data security

(1) Follow model weights and API license terms

(2) Access enterprise data with least privileges and enable audit logs

(3) Configure filtering and manual review


of output sensitive content

Frequently Asked Questions (Q&A)

Q: What are the advantages of Qwen3-Next-80B-A3B's A3B and Ultra-sparse MoE?

A: A3B allows 80B general staff to participate in forward with only 3B activation, and with 512 expert 10+1 routing, it achieves higher throughput and lower billing, which is suitable for AI workloads in 32K+ long contexts and batch processing scenarios.

Q: How to choose the model with Qwen3-32B and Qwen3-235B?

A: In pursuit of cost-effectiveness and long-context efficiency, choose Qwen3-Next-80B-A3B; Flagship requirements that require absolute peak quality and maximum context are considered before the 235B; The stable stock production line can be temporarily retained at 32B as a control baseline.

Q: How does Multi-Token Prediction and Speculative Decoding work in engineering?

A: After enabling MTP, use a large parallel decoding window and monitor the rejection rate; Combined with speculative decoding, the actual latency can be further reduced, but the impact of different tasks on quality needs to be observed.

Q: What is the difference between Instruct and Thinking versions?

A: Instruct is oriented towards instruction compliance and general tasks; Thinking strengthens the chain of thought and reasoning, making it more stable in planning and tool use, and is more suitable for complex retrieval and long-link tasks.

What is Qwen3-Next-80B-A3B? Qwen3-Next-80B-A3B parameter overview Qwen3-Next-80B-A3B Core Selling Points Qwen3-Next-80B-A3B architecture analysis Qwen3-Next-80B-A3BGatedDeltaNet Qwen3-Next-80B-A3BGatedAttention Qwen3-Next-80B-A3BUltra-sparseMoE Qwen3-Next-80B-A3B512 Expert 10 Routing Qwen3-Next-80B-A3B shares expert mechanisms Qwen3-Next-80B-A3BA3B is sparsely activated Qwen3-Next-80B-A3B is only 3B per token Qwen3-Next-80B-A3B32K long context Qwen3-Next-80B-A3B Long Document RAG Qwen3-Next-80B-A3B Retrieval Enhancement Scenario Qwen3-Next-80B-A3B Multi-Document Workflow Qwen3-Next-80B-A3B has high throughput and low latency Qwen3-Next-80B-A3B inference acceleration Qwen3-Next-80B-A3B training cost assessment Qwen3-Next-80B-A3B vs. Qwen3-32B Qwen3-Next-80B-A3B vs. Qwen3-235B Qwen3-Next-80B-A3BInstruct version Qwen3-Next-80B-A3BThinking version Qwen3-Next-80B-A3B Chain of Thought capability Qwen3-Next-80B-A3BMulti-TokenPrediction Qwen3-Next-80B-A3B speculative decoding Qwen3-Next-80B-A3B batch generation Qwen3-Next-80B-A3B offline task practice Qwen3-Next-80B-A3BKV-Cache optimization Qwen3-Next-80B-A3B parallel batch processing Qwen3-Next-80B-A3B tensor parallel routing Qwen3-Next-80B-A3B prompt template Qwen3-Next-80B-A3B retrieval type prompt Qwen3-Next-80B-A3B code-based prompts Qwen3-Next-80B-A3B Chain of Thought Tips Qwen3-Next-80B-A3B evaluation baseline Qwen3-Next-80B-A3B mass throughput cost Qwen3-Next-80B-A3B Long Context Benchmark Qwen3-Next-80B-A3B grayscale replacement strategy Qwen3-Next-80B-A3B calls quota control Qwen3-Next-80B-A3B Budget Alarm Configuration Qwen3-Next-80B-A3BToken monitoring Qwen3-Next-80B-A3B cite evidence log Qwen3-Next-80B-A3B manual sampling rollback Qwen3-Next-80B-A3B version lock policy Qwen3-Next-80B-A3B Licensing & Compliance Qwen3-Next-80B-A3B Least Privilege Access Qwen3-Next-80B-A3B Sensitive Content Filtering Qwen3-Next-80B-A3B Enterprise Landing Guide Qwen3-Next-80B-A3B Deployment Best Practices Qwen3-Next-80B-A3B FAQ

Recommended Tools

More