On October 4, 2025, Qwen officially announced the launch of two new multimodal models, Qwen3-VL-30B-A3B-Instruct and -Thinking, in its codebase, and simultaneously provided FP8 quantized versions. Previously, the larger Qwen3-VL-235B-A22B was released in September and began to provide FP8 variants. 30B-A3B uses a Mixture-of-Experts architecture and has approximately 3B activation parameters per single inference. The goal is to significantly improve throughput and deployment efficiency while maintaining the capabilities of Qwen3-VL. Official channels claim that it can compete with GPT-5-Mini and Claude 4 Sonnet in tasks such as STEM, VQA, OCR, video understanding, and Agent, and is "often leading" in some benchmarks, but independent evaluation is still pending.
Qwen Chat currently offers optional model access, and HuggingFace and ModelScope have launched relevant weighted and quantized versions. The API page also lists the model series. It's important to note that the release post and repository logs are official information, and some performance comparisons are self-reported by the vendor. Without third-party replication experiments, it's inappropriate to conclude that "equals/exceeds" is a definitive statement. For teams focused on cost and deployment, the FP8 version aims to reduce memory and bandwidth usage and improve throughput, but the specific benefits depend on the hardware and inference stack. It's recommended to conduct A/B testing on the target dataset and inference scenario before switching to production.
Frequently Asked Questions
Q: When will the Qwen3-VL-30B-A3B be released?
A: According to the official repository news item, the release date is October 4, 2025; related blogs and model cards will be updated gradually on that day and thereafter.
Q: What does the so-called "3B activation parameter" mean?
A: This is a feature of the MoE (Mixture of Experts) architecture. The complete model has about 30B parameters, but only about 3B are activated each forward pass, which helps improve cost-effectiveness and throughput.
Q: What is the use of the FP8 version?
A: FP8 quantization optimizes inference efficiency and resource usage. In principle, it can reduce video memory and bandwidth requirements and improve throughput. The benefits depend on the hardware and implementation.
Q: Is the comparison with GPT-5-Mini and Claude 4 Sonnet credible?
A: This is the manufacturer's own statement. It lacks sufficient third-party reproduction experiments and public benchmark details. It should be regarded as propaganda. It is recommended to wait for independent evaluation.
Q: Where can I experience or obtain weights?
A: Qwen Chat provides online trials, while HuggingFace and ModelScope have model and quantization versions. Enterprises can access the series of models through the API of Alibaba Cloud Model Studio.