Back to AI is open source
Open Source Commercially Available Multimodal Rationale Model: ERNIE-4.5-VL-28B-A3B-Thinking Analysis

Open Source Commercially Available Multimodal Rationale Model: ERNIE-4.5-VL-28B-A3B-Thinking Analysis

AI is open source Admin 102 views

1. Abstract

ERNIE-4.5-VL-28B-A3B-Thinking is Baidu's new open source lightweight multimodal reasoning model, with 28B of total parameters and about 3B of activation, focusing on the semantic alignment of vision and language and the ability of "Thinking with Images", and supporting zooming/narrowing attention to details. The model is licensed under Apache-2.0 and is commercially available. Officially, it is better than Gemini-2.5-Pro and GPT-5-High in benchmarks such as document and graph understanding (conclusions are subject to reproducible experiments).

2. Core Features

1. 3B activates the MoE architecture: improves the performance of complex tasks while keeping inference costs controllable.

2. Image thinking: multi-scale zooming/browsing details to improve table reading, OCR, and layout understanding.

3. Long document/table analysis: Optimized for document Q&A, table and chart element extraction scenarios.

4. Open for commercial use: Apache-2.0 license, which is convenient for enterprises to implement and develop again.

5. Training and alignment toolchain: Equipped with ERNIEKit, covering SFT, LoRA, DPO and other processes.

3. Installation

1. Model acquisition: Pull weights and examples from Hugging Face or ModelScope.

2. Environment: Prefer to use PaddlePaddle and ERNIEKit. You can also refer to spaces/examples for reasoning.

3. Fine-tuning: LoRA/SFT is available out of the box in ERNIEKit, and you can choose a low-rank or full solution according to the video memory.

4. Typical use cases

1. Document Q&A and layout understanding: structured extraction of invoices, compliance documents, and manuals.

2. Chart understanding: Automatically identify coordinates/legends/data series, and generate summaries and conclusions.

3. Enterprise knowledge retrieval: Combined with RAG, multi-modal search and answer on images and PDFs.

4. Risk control and quality inspection: bill comparison, graphic consistency and element verification.

5. Ecosystem and Competing Products

1. Ecosystem: GitHub unified repository, AI Studio online experience, ModelScope and HF release.

2. Competitors: Qwen2.5-VL, Llama-3.2-Vision, InternVL2.5, etc.; ERNIE's point of difference is the inference efficiency of 3B-activated inference versus "image thinking". The actual effect is subject to the reproduction of the scene.

6. Limitations and precautions

1. The benchmark statement needs to be reproduced: there is a risk of deviation from the alignment with closed-source/different evaluation settings.

2. Memory and delay: Thinking mode increases the number of inference steps and delay.

3. Multilingual coverage: Chinese/English performance is relatively stable, and other languages need to be evaluated additionally.

4. Compliance and data security: It is recommended to add masking and access control to privacy-related documents.

7. Project address

 https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking

8. FAQ

Q: Is ERNIE-4.5-VL-28B-A3B-Thinking licensed commercially?

A: It is licensed under Apache-2.0 and can be used for commercial applications.

Q: How does Thinking Images help with table/chart comprehension?

A: Through multi-scale enlargement and detail tracking, the recognition and association of small print/fine lines/annotations are improved.

Q: What toolchain is recommended for inference?

A: PaddlePaddle + ERNIEKit is recommended; Fine-tuning available with LoRA/SFT/DPO.

Q: How to choose compared to models like Qwen2.5-VL?

A: If you pay attention to inference costs and document/chart scenarios, you can give priority to evaluating this model; Finally, validate with a business set.

Q: Is it supported for local privatization deployment?

A: Yes, local pull weights and fine-tune as needed; Sufficient video memory and inference optimization need to be prepared.

ERNIE4.5VL28B lightweight multimodal model ERNIE4.5 image thinking and reading ability ERNIE4.5Apache 2.0 commercial license ERNIE4.5 triple B activates the MoE architecture ERNIE4.5 Visual Language Semantic Alignment ERNIE4.5 Long Document Table Analysis ERNIE4.5 document Q&A layout understanding ERNIE4.5 Chart coordinate legend recognition ERNIE4.5 small print details enlarge ERNIE4.5 enterprise-level compliance landing ERNIE4.5 supports PaddlePaddle inference ERNIE4.5 comes with ERNIEKit training ERNIE4.5 LoRA trim is available out of the box ERNIE4.5SFT aligns the process with the DPO ERNIE4.5RAG multimodal retrieval Q&A ERNIE4.5 bill and invoice information extraction ERNIE4.5OCR layout structure ERNIE4.5 risk control quality inspection conformity verification ERNIE4.5PDF image unified analysis ERNIE4.5 table elements are automatically extracted ERNIE4.5 Chart Data Series Understanding ERNIE4.5 web search multimodal combination ERNIE4.5 vs. QwenVL comparison review ERNIE4.5 vs. LlamaVision ERNIE4.5 and InternVL differences ERNIE4.5 outperforms closed-source benchmark claims ERNIE4.5 reproducible experiments to be verified ERNIE4.5 inference cost delay evaluation ERNIE4.5 memory occupancy and deployment ERNIE4.5 Local Privatization Deployment Guidelines ERNIE4.5 Multilingual Coverage Capability Assessment ERNIE4.5 performed solidly in Chinese and English ERNIE4.5 Enterprise Scenario Application Case ERNIE4.5 Knowledge Base Q&A Practice ERNIE4.5 model weight acquisition path ERNIE4.5HuggingFace model page ERNIE4.5ModelScope was released simultaneously ERNIE4.5AIStudio online experience ERNIE4.5 image enlargement and reduction inference ERNIE4.5 Document Diagram Joint Understanding ERNIE4.5 model training alignment toolchain ERNIE4.5 low-level fine-tuning memory friendly ERNIE4.5 Multi-Scale Detail Tracking Strategy ERNIE4.5 table chart summary generation ERNIE4.5 compliance and data security recommendations ERNIE4.5 Privacy Document Desensitization ERNIE4.5 is compared with Gemini ERNIE4.5 is compared with GPT series ERNIE4.5 is for enterprise secondary development ERNIE4.5 open-source protocol uses boundaries ERNIE4.5 business set effect verification

Recommended Tools

More