Tongyi Qianwen has launched the next-generation visual language model, the Qwen3-VL . The flagship Qwen3-VL-235B-A22B is available in two open-source versions: Instruct and Thinking . Official materials show that Instruct outperforms the Gemini 2.5 Pro on multiple visual benchmarks, while Thinking achieves leading results in multimodal reasoning tasks. The model supports "visual agents" that can interpret buttons, invoke tools, and complete real-world tasks on PC/mobile interfaces; it has performed exceptionally well in benchmarks such as OS World .
This upgrade emphasizes coverage of long context and complex scenarios: It supports over 256KB of context, expandable to 1MB , and can process approximately two hours of video and multi-page PDFs. It also offers OCR in 32 languages (with enhanced robustness against blurry, skewed, and rare characters), and provides more robust performance in 2D/3D spatial understanding, occlusion, and viewpoint reasoning. Regarding the open ecosystem, online conversation (Qwen Chat), API (Alibaba Cloud Model Studio), and Hugging Face/ModelScope weights and demos have all been released simultaneously.
Frequently Asked Questions
Q: Which variants are open sourced this time?
A: Qwen3-VL-235B-A22B Instruction and Thinking , also provides Caption/demonstration resources and reasoning examples.
Q: What can a visual agent do?
A: Read screen elements and hierarchies, understand buttons and forms, and use tool calls to complete tasks on real devices/applications.
Q: How large is the long context supported?
A: It is marked as 256K+ and can be expanded to 1M level, which is suitable for long video and long document scenarios.
Q: What is the coverage of multi-language capabilities?
A: It supports OCR in 32 languages, and its text capabilities are aligned with top general models for cross-language screen reading and comprehension.
Q: How to experience or access?
A: For Qwen Chat, choose qwen3-vl-plus . Alibaba Cloud Model Studio provides the API. Weights and demos are available in Hugging Face/ModelScope.