GOT-OCR 2.0 The Cutting-Edge Open-Source OCR Model

In the ever-evolving world of OCR (Optical Character Recognition), a groundbreaking model has emerged—GOT-OCR 2.0, which promises to lead OCR technology into a new era. Within just ten days of its release on Hugging Face, it has amassed 122K downloads, signaling a significant leap in open-source OCR technology.

This article dives into the details of GOT-OCR 2.0, including its framework, performance in various complex OCR scenarios, and a step-by-step guide on how to deploy and test it.

GOT-OCR 2.0 Introduction

The General OCR Theory (GOT) proposed by stepfun-ai forms the basis of GOT-OCR 2.0, setting the stage for the next generation of OCR technology. It is designed to handle a variety of optical signals beyond simple text, including tables, formulas, charts, geometric shapes, and even music scores.

Key Features:

Model Parameters: 580M parameters.
Architecture: A high-compression encoder coupled with a long-context decoder for end-to-end processing.
Unified Processing: Capable of handling various document styles, such as text, tables, equations, and visual charts.
Input Flexibility: Supports different image styles, including both sliced and full-page formats.
Output Versatility: Can generate plain text or formatted results, such as LATEX or Markdown for equations and tables.
Interactive Capabilities: Allows region-specific recognition, guided by coordinates or colors.
Dynamic Resolution: Optimized for multi-page OCR tasks, boosting practicality in real-world applications.

The model has been tested extensively and demonstrates excellent performance across diverse OCR tasks, making it one of the most robust solutions available for text, formula, and complex document recognition.

GOT-OCR 2.0 Framework

The overall architecture of GOT-OCR 2.0 is divided into three main stages:

Stage 1: A small OPT-125M pre-trained visual encoder is used to effectively adapt to OCR tasks.
Stage 2: The visual encoder is connected to Qwen-0.5B, constructing the GOT model by leveraging extensive OCR-2.0 knowledge.
Stage 3: The model is fine-tuned for new character recognition features without altering the visual encoder.

To ensure the model performs well across various OCR tasks, six rendering tools were integrated:

LATEX: For table rendering.
Mathpix-markdown-it: For rendering math formulas and molecular structures.
Tikz: For simple geometric shapes.
Verovio: For music notation.
Matplotlib: For charts.
Pyecharts: For more complex chart rendering.

This architecture allows GOT-OCR 2.0 to handle the diverse needs of OCR 2.0 tasks, making it an incredibly versatile solution for modern document recognition.

Performance of GOT-OCR 2.0

Pure Text OCR Performance

In tests on pure text documents, GOT-OCR 2.0 exhibited advanced performance, especially in recognizing text from PDF files, proving its strong document text recognition capabilities.

Scene Text OCR Performance

The model was tested using a custom dataset of 400 natural images split evenly between Chinese and English text. The evaluation metrics included:

Edit Distance
F1 Score
Accuracy
Recall
BLEU Score
METEOR Score

GOT-OCR 2.0 excelled in these tests, showcasing its proficiency in recognizing both document text and scene text across different languages.

Deploying GOT-OCR 2.0: A Step-by-Step Guide

1. Install the Required Environment

!pip install tiktoken==0.7.0
!pip install verovio==4.3.1
import numpy as np
import pandas as pd
import transformers
import accelerate
import torchvision

2. Download the Model Weights

The model's parameters are relatively lightweight at 580M, so they can be loaded directly as float32.

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
ocr_model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
ocr_model = ocr_model.eval().cuda()

3. Run Inference with Sample Code

Here is a basic example of using GOT-OCR 2.0 for text recognition:

image_file = 'sample_image.png'
res = ocr_model.chat(tokenizer, image_file, ocr_type='format')
print(res)

4. Complex Scenario Testing

Case 1: Elementary School Math Test Recognition
GOT-OCR 2.0 correctly recognized math symbols and equations, demonstrating its accuracy in handling educational content.

Case 2: Complex Math Equations
The model successfully converted complex equations into LATEX format, proving its strength in technical content recognition.

Case 3: Visual Chart Recognition
Though some issues were noted with chart outputs, GOT-OCR 2.0 was still able to extract accurate numerical data.

Case 4: Complex Table Recognition
The model effectively recognized table structures and content, outputting results in Markdown and HTML formats.

Case 5: High School Literature Test Recognition
The model performed well in recognizing complex text documents such as high school exam papers.

Conclusion

GOT-OCR 2.0 represents a significant step forward in OCR technology. Its ability to handle various complex recognition tasks, including text, equations, tables, and charts, sets it apart from other open-source OCR models. Whether you're working with educational materials, technical content, or complex layouts, GOT-OCR 2.0 provides a high-performance, flexible solution.

For those looking to explore this powerful tool, you can find it on GitHub or try out the model through the provided demo links.

Project Links

GitHub: https://github.com/Ucas-HaoranWei/GOT-OCR2.0/
Hugging Face Weights: https://hf-mirror.com/stepfun-ai/GOT-OCR2_0
Paper: https://arxiv.org/abs/2409.01704