Back to AI is open source
PaddleOCR-VL-1.5 Open Source Interpretation: How the 0.9B Multimodal Model Overcomes Bending and Distorting Documents

PaddleOCR-VL-1.5 Open Source Interpretation: How the 0.9B Multimodal Model Overcomes Bending and Distorting Documents

AI is open source Admin 140 views

1. Abstract

PaddleOCR-VL-1.5 is an open-source 0.9B parametric document multimodal model of PaddlePaddlePaddle, which provides integrated capabilities from layout positioning, reading order to structured analysis such as text/table/formula, etc., for real acquisition scenarios such as "bending, distorting, tilting, screen photography, and complex lighting". The official public results show that it achieves high accuracy on OmniDocBench v1.5 and Real5-OmniDocBench, which is suitable for document understanding and high-quality data extraction.

2. Core features

  1. Polygon/irregular area positioning: Multi-point polygons are used instead of rigid rectangular boxes, which better fit the boundaries of text and elements under curved and perspective distortion.
  2. Seal and signature recognition: Added the recognition capability for "seal/official seal" elements, which is suitable for the structured extraction of government and enterprise materials and compliance scenarios.
  3. Spread logic and global semantics: Support "whole document level" understanding such as spread table merging and title and hierarchical association, which is conducive to the semantic restoration of long documents.
  4. Multi-task parsing: Cover text, tables, formulas, charts, and other elements, and provide end-to-end document parsing output (such as Markdown/JSON).
  5. Lightweight and high throughput: 0.9B parameters are convenient for cost-controlled deployment; The official material gives end-to-end throughput data on the A100 for batch document processing.
  6. Multilingual: Official materials provide extensive multilingual coverage, including Tibetan, Bengali and other minor languages.

3. Installation

  1. Online experience: Directly use ModelScope Online Demo to upload images or PDFs to quickly verify the analysis effect of scenes such as bending and distortion, screen photography, etc.
  2. Local deployment: Clone the PaddleOCR repository, install dependencies and model resources according to official documentation, and prioritize using Docker to reduce environmental differences.
  3. Inference acceleration: When high throughput is required, use inference backends such as FastDeploy for service-oriented deployment and batch processing acceleration, combined with batch queue and concurrency parameter tuning.

4. Typical use cases

  1. Structure complex scans: Contracts, bills, papers, reports, etc., convert images/PDFs into usable structured Markdown/JSON.
  2. Spread table and table of contents restoration: Automatically merge and organize the spread table at the title level to improve the readability and retrievability of long documents.
  3. Seal element extraction: Extract the seal area and key information in the material verification and risk control archiving, and link it with the rules/manual review.
  4. Document RAG data pipeline: Preserve paragraphs, tables, page numbers, and element coordinates to improve retrieval recall, citation positioning, and answer traceability.

5. Ecology and competing products

  1. Ecology: PaddleOCR provides a complete toolchain from document rendering, layout analysis, to structured output, making it easy to implement batch processing and online services.
  2. Competing products: General multimodal large models and traditional OCR/document parsing solutions have their own advantages; PaddleOCR-VL-1.5 features overlay "True Distortion Document Resolution" multitasking with smaller parameters. The advantages and disadvantages of different schemes depend on the data distribution and evaluation settings, and it is recommended to use their own samples for regression testing before selection.

6. Limitations and precautions

  1. There is a risk of mistaken merger between span merging and hierarchical inference: For documents with extremely irregular layout and strong interference with headers and footers, rule verification and sampling review are required.
  2. Seal recognition has strong business attributes: the seal styles vary greatly between regions/units, and it is recommended to supplement domain data and threshold strategies.
  3. Throughput and cost depend on rendering and inference links: PDF rendering DPI, batch size, concurrency, and back-end implementation will significantly affect speed and cost.
  4. Publicity and comparison need to be interpreted carefully: If you see the comparison conclusion with some closed-source general models, you need to pay attention to the consistency of the evaluation set, prompt words and input processing.

7. Project address

https://github.com/PaddlePaddle/PaddleOCR

8. Frequently asked questions

Q: Is the PaddleOCR-VL-1.5 suitable for bending and twisting document OCR?

A: The official positioning is for scanning distortion, perspective distortion and screen cameras, and provides irregular area positioning and end-to-end resolution capabilities; It is recommended to use your real collection sample for verification.

Q: How do I build a high-precision document RAG with PaddleOCR-VL-1.5?

A: Prioritize outputting structured results (such as Markdown/JSON), retaining the title level, table structure, reading order, page number, and coordinates. Then click the "Paragraph/Table Block" to split into warehouses and create traceable references.

Q: What should I do if the spread table merging effect is unstable?

A: In the post-processing stage, add consistency checks (number of columns/header similarity/page number adjacency), and manually review or fall back to "parse per page" for low-confidence samples.

Q: What should I do if the throughput does not meet the official data?

A: Check PDF rendering time, input resolution, batch and concurrency, GPU utilization, and whether the officially recommended inference backend and parameters are used. Any link in the end-to-end link will become a bottleneck.

Q: Do you support Tibetan, Bengali, and other languages?

A: Official sources provide multilingual coverage and include Tibetan, Bengali, etc.; Before launching, it is still recommended to conduct special sampling and acceptance of the target language.

PaddleOCR-VL-1.5 open source release: 0.9B document multimodal model analysis PaddleOCR-VL-1.5 Polygon Positioning: A New Idea for OCR for Bending and Distorting Documents PaddleOCR-VL-1.5 Seal Recognition: A Guideline for Structured Extraction of Official Seal Materials PaddleOCR-VL-1.5 Spread Merge: How to Automatically Restore Table and Header Levels PaddleOCR-VL-1.5 Getting Started: ModelScope Demo to quickly experience the whole process PaddleOCR-VL-1.5 On-premise deployment: PaddleOCR installation and model download steps PaddleOCR-VL-1.5 Inference Acceleration: FastDeploy Throughput Optimization Practice PaddleOCR-VL-1.5 Document Parsing Output: Markdown/JSON Structuring Best Practices PaddleOCR-VL-1.5 Document RAG: Segmentation, Indexing and Traceable References PaddleOCR-VL-1.5 vs Traditional OCR: Distortion vs. Screen Scene Comparison PaddleOCR-VL-1.5 OmniDocBench v1.5 Metrics Interpretation and Replication Points What is Real5-OmniDocBench: True Distortion Document Benchmark Description Applicable scenarios of PaddleOCR-VL-1.5: Full coverage of contract bill papers and reports PaddleOCR-VL-1.5 Reading Order Prediction: Key Capabilities for Long Document Parsing PaddleOCR-VL-1.5 Table Recognition: Borderless and Spread Table Processing PaddleOCR-VL-1.5 Formula Recognition: Extraction Techniques under Tilt Noise PaddleOCR-VL-1.5 Chart Analysis: From Diagram to Retrievable Text PaddleOCR-VL-1.5 Multilingual OCR: Tibetan and Bengali Interpretation Support PaddleOCR-VL-1.5 Small Parameter High Effect: 0.9B Production Value PaddleOCR-VL-1.5 End-to-End Parsing: From PDF to Structured Output PaddleOCR-VL-1.5 Deployment Pitfall: Dependencies, Memory, and Rendering Parameters PaddleOCR-VL-1.5 Batch Solution: Queue, Concurrency, and Throughput Enhancement PaddleOCR-VL-1.5 Quality Evaluation: How to do regression testing with your own data PaddleOCR-VL-1.5 Post-processing Strategy: Spread Merge Prevention Method PaddleOCR-VL-1.5 Seal Service Implementation: Threshold Policy and Manual Review PaddleOCR-VL-1.5 Coordinates and page number retention: RAG reference positioning design PaddleOCR-VL-1.5 Document Segmentation: Optimal granularity of paragraphs and table blocks PaddleOCR-VL-1.5 Index Construction: How to Archive Structured Fields PaddleOCR-VL-1.5 Retrieval Enhancement Q&A: Semantic Alignment Techniques for Long Documents PaddleOCR-VL-1.5 compatibility: Input specifications for images, PDFs, and scans PaddleOCR-VL-1.5 Screen Camera Documentation: Reflection and Shadow Scene Handling Suggestions PaddleOCR-VL-1.5 Distorted Page: Why Polygon Positioning is More Reliable PaddleOCR-VL-1.5 Title Level: Table of Contents and Chapter Structure Automation PaddleOCR-VL-1.5 Table Merge: Spread Header Consistency Check PaddleOCR-VL-1.5 Production Monitoring: Resolution Failure and Low-Confidence Sample Governance PaddleOCR-VL-1.5 Data Cleaning: Improve the quality of downstream search and Q&A PaddleOCR-VL-1.5 and General Multimodal Models: How to Select and Combine Models Comparison of PaddleOCR-VL-1.5 Competitors: Analysis of the advantages and disadvantages of document parsing solutions PaddleOCR-VL-1.5 Security Compliance: Recommendations for offline deployment of sensitive documents PaddleOCR-VL-1.5 API design: Online service and batch processing interface PaddleOCR-VL-1.5 Rendering Settings: The Impact of DPI on Accuracy and Speed PaddleOCR-VL-1.5 GPU Utilization: A Guide to Batch and Concurrent Adjustment PaddleOCR-VL-1.5 Structured JSON: Field Specification and Extensible Design PaddleOCR-VL-1.5 FAQ: Accuracy, speed, multilingual FAQs PaddleOCR-VL-1.5 Demo Tips: Element-level recognition and full-page parsing PaddleOCR-VL-1.5 Table-to-Structure: From Pictures to Usable Data Tables PaddleOCR-VL-1.5 Long Document Analysis: Spread Semantics and Reading Order in Practice PaddleOCR-VL-1.5 Implementation Case: Building a High-Precision Document RAG Pipeline PaddleOCR-VL-1.5 Open Source Resource Summary: ModelScope and Hugging Face Portal

Recommended Tools

More