1. Abstract
ERNIE-4.5-VL-28B-A3B-Thinking is Baidu's new open source lightweight multimodal reasoning model, with 28B of total parameters and about 3B of activation, focusing on the semantic alignment of vision and language and the ability of "Thinking with Images", and supporting zooming/narrowing attention to details. The model is licensed under Apache-2.0 and is commercially available. Officially, it is better than Gemini-2.5-Pro and GPT-5-High in benchmarks such as document and graph understanding (conclusions are subject to reproducible experiments).
2. Core Features
1. 3B activates the MoE architecture: improves the performance of complex tasks while keeping inference costs controllable.
2. Image thinking: multi-scale zooming/browsing details to improve table reading, OCR, and layout understanding.
3. Long document/table analysis: Optimized for document Q&A, table and chart element extraction scenarios.
4. Open for commercial use: Apache-2.0 license, which is convenient for enterprises to implement and develop again.
5. Training and alignment toolchain: Equipped with ERNIEKit, covering SFT, LoRA, DPO and other processes.
3. Installation
1. Model acquisition: Pull weights and examples from Hugging Face or ModelScope.
2. Environment: Prefer to use PaddlePaddle and ERNIEKit. You can also refer to spaces/examples for reasoning.
3. Fine-tuning: LoRA/SFT is available out of the box in ERNIEKit, and you can choose a low-rank or full solution according to the video memory.
4. Typical use cases
1. Document Q&A and layout understanding: structured extraction of invoices, compliance documents, and manuals.
2. Chart understanding: Automatically identify coordinates/legends/data series, and generate summaries and conclusions.
3. Enterprise knowledge retrieval: Combined with RAG, multi-modal search and answer on images and PDFs.
4. Risk control and quality inspection: bill comparison, graphic consistency and element verification.
5. Ecosystem and Competing Products
1. Ecosystem: GitHub unified repository, AI Studio online experience, ModelScope and HF release.
2. Competitors: Qwen2.5-VL, Llama-3.2-Vision, InternVL2.5, etc.; ERNIE's point of difference is the inference efficiency of 3B-activated inference versus "image thinking". The actual effect is subject to the reproduction of the scene.
6. Limitations and precautions
1. The benchmark statement needs to be reproduced: there is a risk of deviation from the alignment with closed-source/different evaluation settings.
2. Memory and delay: Thinking mode increases the number of inference steps and delay.
3. Multilingual coverage: Chinese/English performance is relatively stable, and other languages need to be evaluated additionally.
4. Compliance and data security: It is recommended to add masking and access control to privacy-related documents.
7. Project address
https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking
8. FAQ
Q: Is ERNIE-4.5-VL-28B-A3B-Thinking licensed commercially?
A: It is licensed under Apache-2.0 and can be used for commercial applications.
Q: How does Thinking Images help with table/chart comprehension?
A: Through multi-scale enlargement and detail tracking, the recognition and association of small print/fine lines/annotations are improved.
Q: What toolchain is recommended for inference?
A: PaddlePaddle + ERNIEKit is recommended; Fine-tuning available with LoRA/SFT/DPO.
Q: How to choose compared to models like Qwen2.5-VL?
A: If you pay attention to inference costs and document/chart scenarios, you can give priority to evaluating this model; Finally, validate with a business set.
Q: Is it supported for local privatization deployment?
A: Yes, local pull weights and fine-tune as needed; Sufficient video memory and inference optimization need to be prepared.