๐ General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
This project presents a unified end - to - end model for OCR-2.0, enabling various OCR tasks with multilingual support.
๐Online Demo | ๐GitHub | ๐Paper
Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

๐ Quick Start
Environment Requirements
Inference using Huggingface transformers on CPU. Requirements tested on python 3.10:
torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0
๐ป Usage Examples
Basic Usage
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()
image_file = 'xxx.jpg'
res = model.chat(tokenizer, image_file, ocr_type='ocr')
print(res)
More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub. Our training codes are also available at the same GitHub repository.
๐ More Multimodal Projects
๐ Welcome to explore more multimodal projects of our team:
Vary | Fox | OneChart
๐ License
This project is licensed under the apache-2.0
license.
๐ Citation
If you find our work helpful, please consider citing our papers ๐ and liking this project โค๏ธ๏ผ
@article{wei2024general,
title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
journal={arXiv preprint arXiv:2409.01704},
year={2024}
}
@article{liu2024focus,
title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2405.14295},
year={2024}
}
@article{wei2023vary,
title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2312.06109},
year={2023}
}
๐ Information Table
Property |
Details |
Pipeline Tag |
image-text-to-text |
Library Name |
transformers |
Language |
multilingual |
Tags |
got, vision-language, ocr2.0, custom_code |
License |
apache-2.0 |