udop-large-512オープンソースドキュメント処理モデル - 無料でデプロイして分類、解析、質問応答などのタスクを実現

ホーム

Udop Large 512

microsoftによって開発

UDOPは視覚、テキスト、レイアウトを統合した汎用文書処理モデルで、T5アーキテクチャに基づき、文書画像分類、解析、視覚質問応答などのタスクに適しています。

画像生成テキスト

Transformers

オープンソースライセンス:MIT #文書視覚質問応答 #マルチモーダル文書処理 #レイアウト認識解析

ダウンロード数 193

リリース時間 : 2/26/2024

モデル概要

UDOPはT5ベースのエンコーダー-デコーダーTransformerアーキテクチャを採用し、視覚、テキスト、レイアウト情報を統合して文書AIタスクを処理します。

モデル特徴

マルチモーダル統合処理

視覚、テキスト、レイアウト情報を統合して共同処理

汎用文書処理

分類、解析、質問応答など多様な文書AIタスクをサポート

T5アーキテクチャベース

確立されたT5エンコーダー-デコーダーTransformerアーキテクチャを採用

モデル能力

文書画像分類

文書構造解析

文書視覚質問応答

文書意味理解

使用事例

文書処理

表データ抽出

文書画像から表データを抽出

出力例：9/30/92

文書分類

文書画像を分類

🚀 UDOPモデル

UDOPモデルは、Zineng Tang、Ziyi Yang、Guoxin Wang、Yuwei Fang、Yang Liu、Chenguang Zhu、Michael Zeng、Cha Zhang、Mohit BansalによるUnifying Vision, Text, and Layout for Universal Document Processingで提案されました。このモデルは、文書画像分類、文書解析、文書視覚的質問応答などの文書AIタスクに利用できます。

🚀 クイックスタート

UDOPモデルは、文書AIタスクに特化したモデルです。T5ベースのエンコーダ・デコーダTransformerアーキテクチャを採用しており、文書画像分類、文書解析、文書視覚的質問応答などのタスクに使用できます。

✨ 主な機能

UDOPは、T5ベースのエンコーダ・デコーダTransformerアーキテクチャを採用しています。
文書画像分類、文書解析、文書視覚的質問応答などの文書AIタスクに使用できます。

📦 インストール

この文書には具体的なインストール手順が記載されていないため、このセクションは省略されます。

💻 使用例

基本的な使用法

from transformers import AutoProcessor, UdopForConditionalGeneration
from datasets import load_dataset

# load model and processor
# in this case, we already have performed OCR ourselves
# so we initialize the processor with `apply_ocr=False`
processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")

# load an example image, along with the words and coordinates
# which were extracted using an OCR engine
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
question = "Question answering. What is the date on the form?"

# prepare everything for the model
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")

# autoregressive generation
predicted_ids = model.generate(**encoding)
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
9/30/92

高度な使用法

微調整や推論については、デモノートブックを参照してください。

📚 ドキュメント

モデルの説明

UDOPは、文書AIタスク用にT5ベースのエンコーダ・デコーダTransformerアーキテクチャを採用しています。文書画像分類、文書解析、文書視覚的質問応答などのタスクに使用できます。

想定される用途と制限

このモデルは、文書画像分類、文書解析、文書視覚的質問応答（DocVQA）に使用できます。

🔧 技術詳細

UDOPは、T5ベースのエンコーダ・デコーダTransformerアーキテクチャを使用して、文書AIタスクを処理します。このアーキテクチャは、文書画像分類、文書解析、文書視覚的質問応答などのタスクに適しています。

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

BibTeXエントリと引用情報

@misc{tang2023unifying,
      title={Unifying Vision, Text, and Layout for Universal Document Processing}, 
      author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
      year={2023},
      eprint={2212.02623},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}