layoutlmv3-base-mpdocvqa開源模型 - 實現多頁文檔視覺問答功能免費部署

首頁

Layoutlmv3 Base Mpdocvqa

由rubentito開發

該模型是基於微軟LayoutLMv3預訓練模型，在多頁文檔問答（MP-DocVQA）數據集上微調的文檔視覺問答模型。

文本生成圖像

Transformers

英語#多頁文檔問答 #視覺文本理解 #文檔智能

下載量 664

發布時間 : 2/21/2023

模型概述

該模型專門用於文檔視覺問答任務，能夠處理多頁文檔中的問答需求，結合文本和視覺信息進行答案預測。

模型特點

多模態處理能力

結合文本和視覺信息進行文檔理解，適用於複雜的文檔視覺問答任務。

多頁文檔支持

能夠處理多頁文檔中的問答需求，預測答案所在頁面。

高效性能

在125M參數規模下實現較好的文檔問答性能。

模型能力

文檔視覺問答

多頁文檔處理

文本和視覺信息融合

使用案例

文檔處理

合同文檔問答

從多頁合同文檔中提取特定條款信息

ANLS 0.4538, APPA 51.9426

報告文檔分析

分析多頁報告文檔中的關鍵數據

🚀 LayoutLMv3 base在MP-DocVQA上微調模型

本項目使用了來自微軟模型庫的預訓練LayoutLMv3模型，並在多頁文檔視覺問答（MP-DocVQA）數據集上進行了微調。

該模型在論文Hierarchical multimodal transformers for Multi-Page DocVQA中被用作基線模型。

MP-DocVQA數據集上的實驗結果見論文中的表2。
訓練超參數可在附錄D的表8中找到。

🚀 快速開始

💻 使用示例

基礎用法

以下是如何在PyTorch中使用該模型獲取給定文本特徵的示例代碼：

import torch
from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering

processor = LayoutLMv3Processor.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa", apply_ocr=False)
model = LayoutLMv3ForQuestionAnswering.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa")

image = Image.open("example.jpg").convert("RGB")
question = "Is this a question?"
context = ["Example"]
boxes = [0, 0, 1000, 1000]  # This is an example bounding box covering the whole image.
document_encoding = processor(image, question, context, boxes=boxes, return_tensors="pt")
outputs = model(**document_encoding)

# Get the answer
start_idx = torch.argmax(outputs.start_logits, axis=1)
end_idx = torch.argmax(outputs.end_logits, axis=1)
answers = self.processor.tokenizer.decode(input_tokens[start_idx: end_idx+1]).strip()

✨ 主要特性

📊 評估指標

平均歸一化Levenshtein相似度（Average Normalized Levenshtein Similarity，ANLS）

這是基於文本的視覺問答任務（ST-VQA和DocVQA）的標準評估指標。它在評估方法推理能力的同時，會對OCR識別錯誤進行平滑懲罰。詳細信息可查看論文Scene Text Visual Question Answering。

答案頁面預測準確率（Answer Page Prediction Accuracy，APPA）

在MP-DocVQA任務中，模型可以給出回答問題所需信息所在頁面的索引。對於這個子任務，使用準確率來評估預測結果，即預測的頁面是否正確。詳細信息可查看論文Hierarchical multimodal transformers for Multi-Page DocVQA。

📈 模型結果

更多擴展實驗結果可查看論文Hierarchical multimodal transformers for Multi-Page DocVQA中的表2。你也可以在RRC門戶查看即時排行榜。

模型	HF名稱	參數數量	ANLS	APPA
Bert large	rubentito/bert-large-mpdocvqa	3.34億	0.4183	51.6177
Longformer base	rubentito/longformer-base-mpdocvqa	1.48億	0.5287	71.1696
BigBird ITC base	rubentito/bigbird-base-itc-mpdocvqa	1.31億	0.4929	67.5433
LayoutLMv3 base	rubentito/layoutlmv3-base-mpdocvqa	1.25億	0.4538	51.9426
T5 base	rubentito/t5-base-mpdocvqa	2.23億	0.5050	0.0000
Hi-VT5	rubentito/hivt5-base-mpdocvqa	3.16億	0.6201	79.23

📚 詳細文檔

📖 引用信息

@article{tito2022hierarchical,
  title={Hierarchical multimodal transformers for Multi-Page DocVQA},
  author={Tito, Rub{\`e}n and Karatzas, Dimosthenis and Valveny, Ernest},
  journal={arXiv preprint arXiv:2212.05935},
  year={2022}
}

📦 模型信息

屬性	詳情
基礎模型	microsoft/layoutlmv3-base
許可證	cc-by-nc-sa-4.0
標籤	DocVQA、Document Question Answering、Document Visual Question Answering
數據集	rubentito/mp-docvqa
語言	en