udop-large-512-300k開源文檔處理模型 - 統一處理視文佈局應對文檔AI任務

首頁

Udop Large 512 300k

由microsoft開發

UDOP是一個統一處理視覺、文本和佈局的通用文檔處理模型，基於T5架構，適用於文檔AI任務。

圖像生成文本

Transformers

開源協議:MIT #文檔視覺問答 #多模態文檔處理 #佈局感知解析

下載量 264

發布時間 : 2/26/2024

模型概述

UDOP採用基於T5的編碼器-解碼器Transformer架構，適用於文檔圖像分類、文檔解析和文檔視覺問答等文檔AI任務。

模型特點

統一多模態處理

能夠同時處理視覺、文本和佈局信息，實現全面的文檔理解

通用文檔AI能力

支持多種文檔AI任務，包括分類、解析和問答

基於T5架構

採用成熟的T5架構，具有良好的擴展性和適應性

模型能力

文檔圖像分類

文檔解析

文檔視覺問答

文本佈局理解

多模態文檔處理

使用案例

文檔處理

文檔圖像分類

自動識別和分類不同類型的文檔圖像

文檔解析

提取文檔中的結構化信息，如表格、字段等

文檔視覺問答

回答基於文檔內容的自然語言問題

示例中正確回答了表格上的日期問題

🚀 UDOP模型

UDOP模型是一個用於通用文檔處理的模型，它統一了視覺、文本和佈局信息，可應用於文檔圖像分類、解析和視覺問答等任務。

🚀 快速開始

UDOP模型由Zineng Tang、Ziyi Yang、Guoxin Wang、Yuwei Fang、Yang Liu、Chenguang Zhu、Michael Zeng、Cha Zhang、Mohit Bansal等人在論文 Unifying Vision, Text, and Layout for Universal Document Processing 中提出。

✨ 主要特性

UDOP採用基於T5的編碼器 - 解碼器Transformer架構，用於處理文檔AI任務，如文檔圖像分類、文檔解析和文檔視覺問答。

📦 安裝指南

文檔中未提及具體安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

from transformers import AutoProcessor, UdopForConditionalGeneration
from datasets import load_dataset

# load model and processor
# in this case, we already have performed OCR ourselves
# so we initialize the processor with `apply_ocr=False`
processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")

# load an example image, along with the words and coordinates
# which were extracted using an OCR engine
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
question = "Question answering. What is the date on the form?"

# prepare everything for the model
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")

# autoregressive generation
predicted_ids = model.generate(**encoding)
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
9/30/92

高級用法

文檔中未提及高級用法相關代碼，故跳過此部分。

📚 詳細文檔

你可以使用該模型進行文檔圖像分類、文檔解析和文檔視覺問答（DocVQA）。關於微調/推理的詳細內容，請參考演示筆記本。

🔧 技術細節

文檔中未提及詳細技術實現細節，故跳過此章節。

📄 許可證

該模型使用MIT許可證。

BibTeX引用

@misc{tang2023unifying,
      title={Unifying Vision, Text, and Layout for Universal Document Processing}, 
      author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
      year={2023},
      eprint={2212.02623},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}