GOT-OCR2.0開源多語言OCR模型 - 端到端架構實現先進文本識別

首頁

GOT CPU

由srimanth-d開發

GOT-OCR2.0是一個多語言通用OCR模型，採用端到端架構實現先進的文本識別能力。

圖像生成文本

Transformers

其他開源協議:Apache-2.0 #端到端OCR #多語言文本識別 #視覺語言統一模型

下載量 960

發布時間 : 9/24/2024

模型概述

該模型通過統一的端到端架構實現了OCR-2.0技術，支持多語言文本識別，結合了視覺語言處理能力，適用於各種文檔和場景文本識別任務。

模型特點

統一端到端架構

採用端到端模型設計，簡化了傳統OCR的多階段流程

多語言支持

能夠處理多種語言的文本識別任務

OCR-2.0技術

實現了新一代OCR技術，提供更準確的文本識別能力

模型能力

文檔文本識別

場景文本識別

多語言文本提取

圖像到文本轉換

使用案例

文檔數字化

紙質文檔OCR

將掃描或拍攝的紙質文檔轉換為可編輯文本

高精度的文本識別結果

場景文本識別

街景文字識別

識別街道標誌、廣告牌等場景中的文字

適應各種字體和背景的識別能力

🚀 通用OCR理論：通過統一的端到端模型邁向OCR-2.0

通用OCR理論項目旨在通過統一的端到端模型邁向OCR-2.0，提供了圖像文本到文本的處理能力，支持多語言，具有廣泛的應用前景。

🚀 快速開始

環境要求

在CPU上使用Huggingface transformers進行推理。測試環境為Python 3.10：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

代碼示例

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

關於 ocr_type、ocr_box、ocr_color 和 render 的更多詳細信息可在我們的 GitHub 上找到。

✨ 主要特性

多語言支持：支持多種語言的處理。
統一端到端模型：邁向OCR-2.0的統一端到端模型。
多模式功能：支持多種OCR模式，如普通文本OCR、格式化文本OCR、細粒度OCR等。

📦 安裝指南

請確保安裝以下依賴庫：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

💻 使用示例

基礎用法

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

print(res)

高級用法

# format texts OCR
res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')

# multi-crop OCR
res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')

# render the formatted OCR results
res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

📚 詳細文檔

在線演示：🔋Online Demo
GitHub倉庫：🌟GitHub
相關論文：📜Paper

🔧 技術細節

本項目基於統一的端到端模型，實現了圖像文本到文本的處理，支持多語言和多種OCR模式。模型使用了Huggingface transformers庫，可在CPU上進行推理。

📄 許可證

本項目採用 Apache-2.0 許可證。

👥 團隊成員

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

🌟 更多多模態項目

👏 歡迎探索我們團隊的更多多模態項目： Vary | Fox | OneChart

📖 引用

如果您覺得我們的工作有幫助，請考慮引用我們的論文 📝 並給這個項目點贊 ❤️！

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}