GOT-OCR2.0开源多语言OCR模型 - 端到端架构实现先进文本识别

首页

GOT CPU

由 srimanth-d 开发

GOT-OCR2.0是一个多语言通用OCR模型，采用端到端架构实现先进的文本识别能力。

图像生成文本

Transformers

其他开源协议:Apache-2.0 #端到端OCR #多语言文本识别 #视觉语言统一模型

下载量 960

发布时间 : 9/24/2024

模型简介

该模型通过统一的端到端架构实现了OCR-2.0技术，支持多语言文本识别，结合了视觉语言处理能力，适用于各种文档和场景文本识别任务。

模型特点

统一端到端架构

采用端到端模型设计，简化了传统OCR的多阶段流程

多语言支持

能够处理多种语言的文本识别任务

OCR-2.0技术

实现了新一代OCR技术，提供更准确的文本识别能力

模型能力

文档文本识别

场景文本识别

多语言文本提取

图像到文本转换

使用案例

文档数字化

纸质文档OCR

将扫描或拍摄的纸质文档转换为可编辑文本

高精度的文本识别结果

场景文本识别

街景文字识别

识别街道标志、广告牌等场景中的文字

适应各种字体和背景的识别能力

🚀 通用OCR理论：通过统一的端到端模型迈向OCR-2.0

通用OCR理论项目旨在通过统一的端到端模型迈向OCR-2.0，提供了图像文本到文本的处理能力，支持多语言，具有广泛的应用前景。

🚀 快速开始

环境要求

在CPU上使用Huggingface transformers进行推理。测试环境为Python 3.10：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

代码示例

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

关于 ocr_type、ocr_box、ocr_color 和 render 的更多详细信息可在我们的 GitHub 上找到。

✨ 主要特性

多语言支持：支持多种语言的处理。
统一端到端模型：迈向OCR-2.0的统一端到端模型。
多模式功能：支持多种OCR模式，如普通文本OCR、格式化文本OCR、细粒度OCR等。

📦 安装指南

请确保安装以下依赖库：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

💻 使用示例

基础用法

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

print(res)

高级用法

# format texts OCR
res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')

# multi-crop OCR
res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')

# render the formatted OCR results
res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

📚 详细文档

在线演示：🔋Online Demo
GitHub仓库：🌟GitHub
相关论文：📜Paper

🔧 技术细节

本项目基于统一的端到端模型，实现了图像文本到文本的处理，支持多语言和多种OCR模式。模型使用了Huggingface transformers库，可在CPU上进行推理。

📄 许可证

本项目采用 Apache-2.0 许可证。

👥 团队成员

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

🌟 更多多模态项目

👏 欢迎探索我们团队的更多多模态项目： Vary | Fox | OneChart

📖 引用

如果您觉得我们的工作有帮助，请考虑引用我们的论文 📝 并给这个项目点赞 ❤️！

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}