LayoutLMv3-base-mpdocvqa Open-source Model - Free Deployment for Multi-page Document Visual Question Answering Function

Layoutlmv3 Base Mpdocvqa

Developed by rubentito

This model is a document visual question answering model fine-tuned on the Multi-page Document VQA (MP-DocVQA) dataset, based on Microsoft's pre-trained LayoutLMv3 model.

Text-to-Image

Transformers

English#Multi-page document QA #Visual text understanding #Document intelligence

Downloads 664

Release Time : 2/21/2023

Model Overview

This model is specifically designed for document visual question answering tasks, capable of handling QA requirements across multi-page documents by combining textual and visual information for answer prediction.

Model Features

Multimodal processing capability

Combines textual and visual information for document understanding, suitable for complex document visual QA tasks.

Multi-page document support

Capable of handling QA requirements across multi-page documents and predicting the page containing the answer.

Efficient performance

Achieves good document QA performance with a 125M parameter scale.

Model Capabilities

Document visual QA

Multi-page document processing

Text and visual information fusion

Use Cases

Document processing

Contract document QA

Extract specific clause information from multi-page contract documents

ANLS 0.4538, APPA 51.9426

Report document analysis

Analyze key data in multi-page report documents

🚀 LayoutLMv3 base fine-tuned on MP-DocVQA

This is a pre - trained LayoutLMv3 model from Microsoft hub, which has been fine - tuned on the Multipage DocVQA (MP - DocVQA) dataset. It was used as a baseline in Hierarchical multimodal transformers for Multi - Page DocVQA.

Model Information

Property	Details
Base Model	microsoft/layoutlmv3 - base
License	cc - by - nc - sa - 4.0
Tags	DocVQA, Document Question Answering, Document Visual Question Answering
Datasets	rubentito/mp - docvqa
Language	en

🚀 Quick Start

This model is a fine - tuned version of the pre - trained LayoutLMv3 from Microsoft hub on the Multipage DocVQA (MP - DocVQA) dataset. It was used as a baseline in Hierarchical multimodal transformers for Multi - Page DocVQA.

Results on the MP - DocVQA dataset are reported in Table 2.
Training hyperparameters can be found in Table 8 of Appendix D.

💻 Usage Examples

Basic Usage

import torch
from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering

processor = LayoutLMv3Processor.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa", apply_ocr=False)
model = LayoutLMv3ForQuestionAnswering.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa")

image = Image.open("example.jpg").convert("RGB")
question = "Is this a question?"
context = ["Example"]
boxes = [0, 0, 1000, 1000]  # This is an example bounding box covering the whole image.
document_encoding = processor(image, question, context, boxes=boxes, return_tensors="pt")
outputs = model(**document_encoding)

# Get the answer
start_idx = torch.argmax(outputs.start_logits, axis=1)
end_idx = torch.argmax(outputs.end_logits, axis=1)
answers = self.processor.tokenizer.decode(input_tokens[start_idx: end_idx+1]).strip()

📚 Documentation

Metrics

Average Normalized Levenshtein Similarity (ANLS)

The standard metric for text - based VQA tasks (ST - VQA and DocVQA). It evaluates the method's reasoning capabilities while smoothly penalizes OCR recognition errors. Check Scene Text Visual Question Answering for detailed information.

Answer Page Prediction Accuracy (APPA)

In the MP - DocVQA task, the models can provide the index of the page where the information required to answer the question is located. For this subtask, accuracy is used to evaluate the predictions: i.e., if the predicted page is correct or not. Check Hierarchical multimodal transformers for Multi - Page DocVQA for detailed information.

Model results

Extended experimentation can be found in Table 2 of Hierarchical multimodal transformers for Multi - Page DocVQA. You can also check the live leaderboard at the RRC Portal.

Model	HF name	Parameters	ANLS	APPA
Bert large	rubentito/bert-large-mpdocvqa	334M	0.4183	51.6177
Longformer base	rubentito/longformer-base-mpdocvqa	148M	0.5287	71.1696
BigBird ITC base	rubentito/bigbird-base-itc-mpdocvqa	131M	0.4929	67.5433
LayoutLMv3 base	rubentito/layoutlmv3-base-mpdocvqa	125M	0.4538	51.9426
T5 base	rubentito/t5-base-mpdocvqa	223M	0.5050	0.0000
Hi - VT5	rubentito/hivt5-base-mpdocvqa	316M	0.6201	79.23

📄 License

The model uses the cc - by - nc - sa - 4.0 license.

📖 Citation Information

@article{tito2022hierarchical,
  title={Hierarchical multimodal transformers for Multi-Page DocVQA},
  author={Tito, Rub{\`e}n and Karatzas, Dimosthenis and Valveny, Ernest},
  journal={arXiv preprint arXiv:2212.05935},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご