In the ever-evolving world of OCR (Optical Character Recognition), a groundbreaking model has emerged—GOT-OCR 2.0, which promises to lead OCR technology into a new era. Within just ten days of its release on Hugging Face, it has amassed 122K downloads, signaling a significant leap in open-source OCR technology.
This article dives into the details of GOT-OCR 2.0, including its framework, performance in various complex OCR scenarios, and a step-by-step guide on how to deploy and test it.
GOT-OCR 2.0 Introduction
The General OCR Theory (GOT) proposed by stepfun-ai forms the basis of GOT-OCR 2.0, setting the stage for the next generation of OCR technology. It is designed to handle a variety of optical signals beyond simple text, including tables, formulas, charts, geometric shapes, and even music scores.
Key Features:
- Model Parameters: 580M parameters.
- Architecture: A high-compression encoder coupled with a long-context decoder for end-to-end processing.
- Unified Processing: Capable of handling various document styles, such as text, tables, equations, and visual charts.
- Input Flexibility: Supports different image styles, including both sliced and full-page formats.
- Output Versatility: Can generate plain text or formatted results, such as LATEX or Markdown for equations and tables.
- Interactive Capabilities: Allows region-specific recognition, guided by coordinates or colors.
- Dynamic Resolution: Optimized for multi-page OCR tasks, boosting practicality in real-world applications.
The model has been tested extensively and demonstrates excellent performance across diverse OCR tasks, making it one of the most robust solutions available for text, formula, and complex document recognition.
GOT-OCR 2.0 Framework
The overall architecture of GOT-OCR 2.0 is divided into three main stages:
- Stage 1: A small OPT-125M pre-trained visual encoder is used to effectively adapt to OCR tasks.
- Stage 2: The visual encoder is connected to Qwen-0.5B, constructing the GOT model by leveraging extensive OCR-2.0 knowledge.
- Stage 3: The model is fine-tuned for new character recognition features without altering the visual encoder.
To ensure the model performs well across various OCR tasks, six rendering tools were integrated:
- LATEX: For table rendering.
- Mathpix-markdown-it: For rendering math formulas and molecular structures.
- Tikz: For simple geometric shapes.
- Verovio: For music notation.
- Matplotlib: For charts.
- Pyecharts: For more complex chart rendering.
This architecture allows GOT-OCR 2.0 to handle the diverse needs of OCR 2.0 tasks, making it an incredibly versatile solution for modern document recognition.
Performance of GOT-OCR 2.0
Pure Text OCR Performance
In tests on pure text documents, GOT-OCR 2.0 exhibited advanced performance, especially in recognizing text from PDF files, proving its strong document text recognition capabilities.
Scene Text OCR Performance
The model was tested using a custom dataset of 400 natural images split evenly between Chinese and English text. The evaluation metrics included:
- Edit Distance
- F1 Score
- Accuracy
- Recall
- BLEU Score
- METEOR Score
GOT-OCR 2.0 excelled in these tests, showcasing its proficiency in recognizing both document text and scene text across different languages.
Deploying GOT-OCR 2.0: A Step-by-Step Guide
1. Install the Required Environment
!pip install tiktoken==0.7.0 !pip install verovio==4.3.1 import numpy as np import pandas as pd import transformers import accelerate import torchvision
2. Download the Model Weights
The model's parameters are relatively lightweight at 580M, so they can be loaded directly as float32
.
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True) ocr_model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id) ocr_model = ocr_model.eval().cuda()
3. Run Inference with Sample Code
Here is a basic example of using GOT-OCR 2.0 for text recognition:
image_file = 'sample_image.png' res = ocr_model.chat(tokenizer, image_file, ocr_type='format') print(res)
4. Complex Scenario Testing
Case 1: Elementary School Math Test Recognition
GOT-OCR 2.0 correctly recognized math symbols and equations, demonstrating its accuracy in handling educational content.
Case 2: Complex Math Equations
The model successfully converted complex equations into LATEX format, proving its strength in technical content recognition.
Case 3: Visual Chart Recognition
Though some issues were noted with chart outputs, GOT-OCR 2.0 was still able to extract accurate numerical data.
Case 4: Complex Table Recognition
The model effectively recognized table structures and content, outputting results in Markdown and HTML formats.
Case 5: High School Literature Test Recognition
The model performed well in recognizing complex text documents such as high school exam papers.
Conclusion
GOT-OCR 2.0 represents a significant step forward in OCR technology. Its ability to handle various complex recognition tasks, including text, equations, tables, and charts, sets it apart from other open-source OCR models. Whether you're working with educational materials, technical content, or complex layouts, GOT-OCR 2.0 provides a high-performance, flexible solution.
For those looking to explore this powerful tool, you can find it on GitHub or try out the model through the provided demo links.
Project Links
- GitHub: https://github.com/Ucas-HaoranWei/GOT-OCR2.0/
- Hugging Face Weights: https://hf-mirror.com/stepfun-ai/GOT-OCR2_0
- Paper: https://arxiv.org/abs/2409.01704