Donut-proto Open-source Document Understanding Model - Achieve Image-to-Text Conversion without OCR

Donut Proto

Developed by naver-clova-ix

Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder for image-to-text conversion

Image-to-Text

Transformers

Open Source License:MIT #OCR-free Document Understanding #Visual-Text Conversion #Swin-BART Architecture

Downloads 30

Release Time : 7/19/2022

Model Overview

The Donut model consists of a Swin Transformer visual encoder and BART text decoder, capable of encoding images into embedding tensors and autoregressively generating text, specifically designed for document understanding tasks

Model Features

OCR-free Processing

Directly processes image inputs, avoiding error accumulation issues in traditional OCR pipelines

End-to-End Training

Joint training of visual encoder and text decoder enables direct image-to-text conversion

Document Understanding Capability

Specifically optimized for document images to understand document structure and content

Model Capabilities

Document Image Processing

Image-to-Text Conversion

Document Structure Understanding

Vision-Language Joint Modeling

Use Cases

Document Processing

Document Image Classification

Automatically identifies and classifies different types of document images

Document Parsing

Extracts structured information from document images

🚀 Donut (base-sized model, pre-trained only)

A pre-trained Donut model introduced in the paper OCR-free Document Understanding Transformer by Geewok et al., and first released in this repository.

Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.

🚀 Quick Start

This section provides an overview of the Donut model and its basic information.

✨ Features

Model Architecture: Donut combines a vision encoder (Swin Transformer) and a text decoder (BART).
Functionality: Given an image, the encoder encodes it into an embedding tensor, and then the decoder autoregressively generates text based on the encoder's output.

📚 Documentation

Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

model image

Intended uses & limitations

This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the model hub to look for fine-tuned versions on a task that interests you.

We refer to the documentation which includes code examples.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご