GOT-OCR2.0 Open-Source Multilingual OCR Model - Achieving Advanced Text Recognition with an End-to-End Architecture

GOT CPU

Developed by srimanth-d

GOT-OCR2.0 is a multilingual general OCR model that employs an end-to-end architecture to achieve advanced text recognition capabilities.

Image-to-Text

Transformers

OtherOpen Source License:Apache-2.0 #End-to-End OCR #Multilingual Text Recognition #Vision-Language Unified Model

Downloads 960

Release Time : 9/24/2024

Model Overview

This model implements OCR-2.0 technology through a unified end-to-end architecture, supporting multilingual text recognition and combining vision-language processing capabilities, making it suitable for various document and scene text recognition tasks.

Model Features

Unified End-to-End Architecture

Adopts an end-to-end model design, simplifying the multi-stage process of traditional OCR.

Multilingual Support

Capable of handling text recognition tasks in multiple languages.

OCR-2.0 Technology

Implements next-generation OCR technology, providing more accurate text recognition capabilities.

Model Capabilities

Document Text Recognition

Scene Text Recognition

Multilingual Text Extraction

Image to Text Conversion

Use Cases

Document Digitization

Paper Document OCR

Converts scanned or photographed paper documents into editable text

High-precision text recognition results

Scene Text Recognition

Street View Text Recognition

Recognizes text in street signs, billboards, and other scene contexts

Adaptive recognition capability for various fonts and backgrounds

🚀 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

This project presents a unified end - to - end model for OCR-2.0, enabling various OCR tasks with multilingual support.

🔋Online Demo | 🌟GitHub | 📜Paper

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

image/jpeg

🚀 Quick Start

Environment Requirements

Inference using Huggingface transformers on CPU. Requirements tested on python 3.10:

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub. Our training codes are also available at the same GitHub repository.

📚 More Multimodal Projects

👏 Welcome to explore more multimodal projects of our team:

Vary | Fox | OneChart

📄 License

This project is licensed under the apache-2.0 license.

📚 Citation

If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

📋 Information Table

Property	Details
Pipeline Tag	image-text-to-text
Library Name	transformers
Language	multilingual
Tags	got, vision-language, ocr2.0, custom_code
License	apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご