Donut Receipts Extract

Developed by AdamCodd

A specialized receipt text extraction model based on the Donut architecture, achieving OCR-free document understanding through visual encoder and text decoder

Image-to-Text

Transformers

#Receipt Text Extraction #OCR-Free Document Understanding #High-Precision Table Recognition

Downloads 66

Release Time : 1/28/2024

Model Overview

This model is specifically designed for extracting structured text information from receipt images, utilizing Swin Transformer as the visual encoder and BART as the text decoder architecture, supporting end-to-end receipt information recognition and extraction.

Model Features

OCR-Free Document Understanding

Directly processes image inputs and extracts text information without traditional OCR preprocessing steps

Dual-Resolution Processing

V2 version uses double resolution for receipt images, significantly improving recognition accuracy

Structured Output

Automatically generates structured data in JSON format, including key receipt fields (e.g., amount, phone number, discount)

Improved Dataset

Trained on a deduplicated and manually corrected dataset, showing significant performance improvements over V1

Model Capabilities

Receipt Image Recognition

Text Information Extraction

Structured Data Generation

Multi-Field Joint Parsing

Use Cases

Retail & Finance

Electronic Receipt Archiving

Automatically extracts key information such as amount and date from paper receipts

89.5% accuracy, 15.8% character error rate

Expense Reimbursement System

Recognizes receipt images submitted by employees and automatically fills reimbursement forms

Supports extraction of 12 key fields including <s_total> and <s_date>

license: cc-by-nc-4.0 inference: false base_model: naver-clova-ix/donut-base tags:

donut
image-to-text
vision model-index:
name: donut-receipts-extract results:
- task: type: image-to-text name: Image to text metrics:
  - type: loss value: 0.326069
  - type: accuracy value: 0.895219 name: Accuracy
  - type: cer value: 0.158358 name: CER
  - type: wer value: 1.673989 name: WER
  - type: edit distance value: 0.145293 name: Edit_distance metrics:
cer
wer
accuracy datasets:
AdamCodd/donut-receipts pipeline_tag: image-to-text extra_gated_prompt: "To get access to this model, send an email to adamcoddml@gmail.com and provide a brief description of your project or application. Requests without this information will not be considered, and access will not be granted under any circumstances." extra_gated_fields: Company/University: text Country: country

Donut-receipts-extract

Donut model was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository.

=== V2 ===

This model has been retrained on an improved version of the AdamCodd/donut-receipts dataset (deduplicated, manually corrected). The new license for the V2 model is cc-by-nc-4.0. For commercial use rights, please contact me (adamcoddml@gmail.com). Meanwhile, the V1 model remains available under the MIT license (under v1 branch).

It achieves the following results on the evaluation set:

Loss: 0.326069
Edit distance: 0.145293
CER: 0.158358
WER: 1.673989
Mean accuracy: 0.895219
F1: 0.977897

The task_prompt has been changed to <s_receipt> for the V2 (previously <s_cord-v2> for V1). Two new keys <s_svc> and <s_discount> have been added, <s_telephone> has been renamed to <s_phone>.

The V2 performs way better than the V1 as it has been trained on twice the resolution for the receipts, using a better dataset. Despite that, it's not perfect due to a lack of diverse receipts (the training dataset is still ~1100 receipts); for a future version, that will be the main focus.

=== V1 ====

This model is a finetune of the donut base model on the AdamCodd/donut-receipts dataset. Its purpose is to efficiently extract text from receipts.

It achieves the following results on the evaluation set:

Loss: 0.498843
Edit distance: 0.198315
CER: 0.213929
WER: 7.634032
Mean accuracy: 0.843472

Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

model image

How to use

import torch
import re
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
processor = DonutProcessor.from_pretrained("AdamCodd/donut-receipts-extract")
model = VisionEncoderDecoderModel.from_pretrained("AdamCodd/donut-receipts-extract")
model.to(device)

def load_and_preprocess_image(image_path: str, processor):
    """
    Load an image and preprocess it for the model.
    """
    image = Image.open(image_path).convert("RGB")
    pixel_values = processor(image, return_tensors="pt").pixel_values
    return pixel_values

def generate_text_from_image(model, image_path: str, processor, device):
    """
    Generate text from an image using the trained model.
    """
    # Load and preprocess the image
    pixel_values = load_and_preprocess_image(image_path, processor)
    pixel_values = pixel_values.to(device)

    # Generate output using model
    model.eval()
    with torch.no_grad():
        task_prompt = "<s_receipt>" # <s_cord-v2> for v1
        decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
        decoder_input_ids = decoder_input_ids.to(device)
        generated_outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=model.decoder.config.max_position_embeddings, 
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            early_stopping=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
            return_dict_in_generate=True
        )

    # Decode generated output
    decoded_text = processor.batch_decode(generated_outputs.sequences)[0]
    decoded_text = decoded_text.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    decoded_text = re.sub(r"<.*?>", "", decoded_text, count=1).strip()  # remove first task start token
    decoded_text = processor.token2json(decoded_text)
    return decoded_text

# Example usage
image_path = "path_to_your_image"  # Replace with your image path
extracted_text = generate_text_from_image(model, image_path, processor, device)
print("Extracted Text:", extracted_text)

Refer to the documentation for more code examples.

Intended uses & limitations

This fine-tuned model is specifically designed for extracting text from receipts and may not perform optimally on other types of documents. The dataset used is still suboptimal (numerous errors are still there) so this model will need to be retrained at a later date to improve its performance.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 2
eval_batch_size: 4
seed: 42
optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 300
num_epochs: 35
weight_decay: 0.01

Framework versions

Transformers 4.36.2
Datasets 2.16.1
Tokenizers 0.15.0
Evaluate 0.4.1

If you want to support me, you can here.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご