Udop-large-512-300k Open-source Document Processing Model - Unified Handling of Visual-Textual Layouts for Document AI Tasks

Udop Large 512 300k

Developed by microsoft

UDOP is a universal document processing model that unifies vision, text, and layout, based on the T5 architecture, suitable for document AI tasks.

Image-to-Text

Transformers

Open Source License:MIT #Document Visual Question Answering #Multimodal Document Processing #Layout-Aware Parsing

Downloads 264

Release Time : 2/26/2024

Model Overview

UDOP adopts an encoder-decoder Transformer architecture based on T5, applicable to document AI tasks such as document image classification, document parsing, and document visual question answering.

Model Features

Unified Multimodal Processing

Capable of simultaneously processing visual, textual, and layout information for comprehensive document understanding

General Document AI Capabilities

Supports various document AI tasks, including classification, parsing, and question answering

Based on T5 Architecture

Utilizes the proven T5 architecture, offering excellent scalability and adaptability

Model Capabilities

Document image classification

Document parsing

Document visual question answering

Text layout understanding

Multimodal document processing

Use Cases

Document Processing

Document Image Classification

Automatically identify and classify different types of document images

Document Parsing

Extract structured information from documents, such as tables and fields

Document Visual Question Answering

Answer natural language questions based on document content

Example correctly answered a date-related question from a table

🚀 UDOP model

The UDOP model unifies vision, text, and layout for universal document processing. It offers solutions for various document AI tasks, such as document image classification, parsing, and visual question answering.

🚀 Quick Start

The UDOP model was proposed in Unifying Vision, Text, and Layout for Universal Document Processing by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.

✨ Features

UDOP adopts an encoder - decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.

💻 Usage Examples

Basic Usage

Here's how to use the model on a document image:

from transformers import AutoProcessor, UdopForConditionalGeneration
from datasets import load_dataset

# load model and processor
# in this case, we already have performed OCR ourselves
# so we initialize the processor with `apply_ocr=False`
processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")

# load an example image, along with the words and coordinates
# which were extracted using an OCR engine
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
question = "Question answering. What is the date on the form?"

# prepare everything for the model
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")

# autoregressive generation
predicted_ids = model.generate(**encoding)
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
9/30/92

Advanced Usage

Refer to the demo notebooks for fine - tuning/inference.

📚 Documentation

Intended uses & limitations

You can use the model for document image classification, document parsing and document visual question answering (DocVQA).

BibTeX entry and citation info

@misc{tang2023unifying,
      title={Unifying Vision, Text, and Layout for Universal Document Processing}, 
      author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
      year={2023},
      eprint={2212.02623},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご