GOT - OCR2.0オープンソース多言語OCRモデル - エンドツーエンドアーキテクチャによる高度なテキスト認識の実現

Home

GOT CPU

Developed by srimanth-d

GOT-OCR2.0は多言語対応の汎用OCRモデルで、エンドツーエンドアーキテクチャにより先進的なテキスト認識能力を実現しています。

画像生成テキスト

Transformers

OtherOpen Source License:Apache-2.0 #エンドツーエンドOCR #多言語テキスト認識 #視覚言語統一モデル

Downloads 960

Release Time : 9/24/2024

Model Overview

このモデルは統一されたエンドツーエンドアーキテクチャによりOCR-2.0技術を実現し、多言語テキスト認識をサポート、視覚言語処理能力を組み合わせ、様々なドキュメントやシーンテキスト認識タスクに適しています。

Model Features

統一エンドツーエンドアーキテクチャ

エンドツーエンドモデル設計を採用し、従来のOCRの多段階プロセスを簡素化

多言語サポート

複数言語のテキスト認識タスクを処理可能

OCR-2.0技術

次世代OCR技術を実現し、より正確なテキスト認識能力を提供

Model Capabilities

ドキュメントテキスト認識

シーンテキスト認識

多言語テキスト抽出

画像からテキストへの変換

Use Cases

ドキュメントデジタル化

紙文書OCR

スキャンまたは撮影した紙文書を編集可能なテキストに変換

高精度なテキスト認識結果

シーンテキスト認識

街中の文字認識

道路標識、看板などのシーン中の文字を認識

様々なフォントや背景に対応した認識能力

🚀 General OCR Theory: Towards OCR - 2.0 via a Unified End - to - end Model

画像とテキストを入力としてテキストを出力するモデルで、Transformersライブラリを使用し、多言語に対応したOCR関連のモデルです。

プロパティ	詳細
パイプラインタグ	画像とテキストからテキストへの変換
ライブラリ名	Transformers
言語	多言語
タグ	got、vision - language、ocr2.0、custom_code
ライセンス	Apache - 2.0

🔋オンラインデモ | 🌟GitHub | 📜論文

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

image/jpeg

🚀 クイックスタート

Huggingface Transformersを使用してCPU上で推論を行う方法について説明します。Python 3.10でテストされた要件は以下の通りです。

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

💻 使用例

基本的な使用法

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True)
model = AutoModel.from_pretrained('srimanth-d/GOT_CPU', trust_remote_code=True, low_cpu_mem_usage=True, use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval()

# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

'ocr_type'、'ocr_box'、'ocr_color'、および'render'に関する詳細は、GitHubで確認できます。トレーニングコードはGitHubで入手できます。

📚 ドキュメント

👏 当チームの他のマルチモーダルプロジェクトをご探索ください。

Vary | Fox | OneChart

📄 ライセンス

このプロジェクトはApache - 2.0ライセンスの下で公開されています。

📜 引用

もしこの研究が役に立った場合は、以下の論文を引用し、このプロジェクトをいいねしていただけると幸いです。

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}