Qwen officially announced that its visual language model, Qwen3-VL, is now natively supported in llama.cpp, and a full range of GGUF weights have been released, covering various specifications from 2B to 235B. It can run directly on CPU, CUDA, Metal, Vulkan, and other backends. Download links are now available on Hugging Face and the Moda community, allowing users to choose the quantization version based on device and precision.
The merge request for llama.cpp has been incorporated into the main repository, indicating the addition of support for loading and inference of Qwen3-VL (including Dense and MoE variants); the Qwen repository and documentation have also been updated with local execution and GGUF usage guidelines. Overall, this update achieves the implementation of the "official announcement + weight release + inference framework support" three-piece set, lowering the deployment threshold for multimodal large models on edge and personal devices.
Frequently Asked Questions
Q: What exactly does this update include?
A: The llama.cpp trunk has been merged into Qwen3-VL support; the official website has also released GGUF weights from 2B to 235B, and provided a collection page for easy download and selection of quantifications.
Q: On which hardware can it run?
A: According to official statements, it supports backends such as CPU, NVIDIA CUDA, Apple Metal, and Vulkan, and is compatible with common desktop and laptop environments.
Q: Where do I get the weights?
A: Both Hugging Face and ModelScope provide Qwen3-VL collections and corresponding GGUF repositories.
Q: How is the merge status confirmed?
A: The PR for llama.cpp has been marked "Merged". You can view the change and commit history in the main repository.
Q: Does it include a running guide?
A: Qwen documentation and repository provide instructions for running llama.cpp locally and using GGUF, covering model acquisition and startup examples.