Open-source OCR - DocVQA - Donut Model: Achieve Document Visual Question Answering without OCR

OCR DocVQA Donut

Developed by jinhybr

Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder for document visual question answering tasks.

Image-to-Text

Transformers

Open Source License:MIT #OCR-free document understanding #Visual Question Answering #Swin-BART architecture

Downloads 240

Release Time : 11/4/2022

Model Overview

The DocVQA-fine-tuned Donut model uses Swin Transformer for image encoding and BART decoder for text generation, achieving OCR-free document understanding.

Model Features

OCR-free processing

Directly understands document content from images without traditional OCR steps

End-to-end training

Joint optimization of visual encoding and text generation

Document understanding

Can parse key information from structured documents like invoices and contracts

Model Capabilities

Document image understanding

Visual question answering

Key information extraction

Cross-modal representation learning

Use Cases

Document processing

Invoice information extraction

Automatically identifies key fields like invoice numbers and amounts from invoice images

Examples show accurate extraction of invoice numbers

Contract parsing

Analyzes terms and amount information in contract documents

Examples demonstrate recognition of purchase amounts

Property	Details
Pipeline Tag	Document Question Answering
Tags	Donut, Image-to-Text, Vision

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

OCR DocVQA Donut

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Donut (base-sized model, fine-tuned on DocVQA)

🚀 Quick Start

✨ Features

📚 Documentation

Model description

Intended uses & limitations

📄 License