donut - receipts - extract開源收據文本提取模型，無需OCR輕鬆實現文檔理解

Home

Donut Receipts Extract

Developed by AdamCodd

基於Donut架構的收據文本提取專用模型，通過視覺編碼器和文本解碼器實現無需OCR的文檔理解

圖像生成文本

Transformers

#收據文本提取 #無OCR文檔理解 #高精度表格識別

Downloads 66

Release Time : 1/28/2024

Model Overview

該模型專門用於從收據圖像中提取結構化文本信息，採用Swin Transformer視覺編碼器和BART文本解碼器架構，支持端到端的收據信息識別與提取。

Model Features

無需OCR的文檔理解

直接處理圖像輸入，無需傳統OCR預處理步驟即可提取文本信息

雙分辨率處理

V2版本採用雙倍分辨率處理收據圖像，顯著提升識別精度

結構化輸出

自動生成JSON格式的結構化數據，包含收據關鍵字段（如金額、電話、折扣等）

改進的數據集

基於去重並人工校正的數據集訓練，相比V1版本性能顯著提升

Model Capabilities

收據圖像識別

文本信息提取

結構化數據生成

多字段聯合解析

Use Cases

零售與財務

電子收據歸檔

自動提取紙質收據的金額、日期等關鍵信息

準確率89.5%，字符錯誤率15.8%

費用報銷系統

識別員工提交的收據圖像並自動填充報銷表單

支持<s_total>、<s_date>等12個關鍵字段提取

🚀 Donut-receipts-extract

Donut-receipts-extract 是一個基於 Donut 模型的微調模型，專門用於從收據中高效提取文本信息。它在特定數據集上進行訓練和優化，在收據文本提取任務上取得了較好的效果。

🚀 快速開始

環境準備

確保你已經安裝了必要的庫，如 torch、transformers 等。

代碼示例

import torch
import re
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
processor = DonutProcessor.from_pretrained("AdamCodd/donut-receipts-extract")
model = VisionEncoderDecoderModel.from_pretrained("AdamCodd/donut-receipts-extract")
model.to(device)

def load_and_preprocess_image(image_path: str, processor):
    """
    Load an image and preprocess it for the model.
    """
    image = Image.open(image_path).convert("RGB")
    pixel_values = processor(image, return_tensors="pt").pixel_values
    return pixel_values

def generate_text_from_image(model, image_path: str, processor, device):
    """
    Generate text from an image using the trained model.
    """
    # Load and preprocess the image
    pixel_values = load_and_preprocess_image(image_path, processor)
    pixel_values = pixel_values.to(device)

    # Generate output using model
    model.eval()
    with torch.no_grad():
        task_prompt = "<s_receipt>" # <s_cord-v2> for v1
        decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
        decoder_input_ids = decoder_input_ids.to(device)
        generated_outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=model.decoder.config.max_position_embeddings, 
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            early_stopping=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
            return_dict_in_generate=True
        )

    # Decode generated output
    decoded_text = processor.batch_decode(generated_outputs.sequences)[0]
    decoded_text = decoded_text.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    decoded_text = re.sub(r"<.*?>", "", decoded_text, count=1).strip()  # remove first task start token
    decoded_text = processor.token2json(decoded_text)
    return decoded_text

# Example usage
image_path = "path_to_your_image"  # Replace with your image path
extracted_text = generate_text_from_image(model, image_path, processor, device)
print("Extracted Text:", extracted_text)

更多代碼示例請參考文檔。

✨ 主要特性

模型架構：Donut 模型由視覺編碼器（Swin Transformer）和文本解碼器（BART）組成。給定一張圖像，編碼器先將圖像編碼為嵌入張量，然後解碼器根據編碼器的編碼自迴歸地生成文本。
版本更新：V2 版本在改進後的數據集上進行了重新訓練，數據集經過去重和手動修正，性能優於 V1 版本。
特定任務優化：該微調模型專門為從收據中提取文本而設計。

📚 詳細文檔

版本說明

V2 版本

數據集：在改進後的 AdamCodd/donut-receipts 數據集上重新訓練（去重、手動修正）。
許可證：V2 模型的新許可證為 cc-by-nc-4.0。如需商業使用權，請聯繫（adamcoddml@gmail.com）。同時，V1 模型仍可在 MIT 許可證下使用（在 v1 分支下）。
評估結果：
- 損失：0.326069
- 編輯距離：0.145293
- CER：0.158358
- WER：1.673989
- 平均準確率：0.895219
- F1：0.977897
任務提示變更：V2 的任務提示已更改為 <s_receipt>（V1 為 <s_cord-v2>）。新增了兩個鍵 <s_svc> 和 <s_discount>，<s_telephone> 重命名為 <s_phone>。

V1 版本

微調基礎：該模型是在 donut 基礎模型上針對 AdamCodd/donut-receipts 數據集進行微調的。
評估結果：
- 損失：0.498843
- 編輯距離：0.198315
- CER：0.213929
- WER：7.634032
- 平均準確率：0.843472

預期用途和侷限性

預期用途：該微調模型專門用於從收據中提取文本，在其他類型的文檔上可能無法達到最佳性能。
侷限性：使用的數據集仍不理想（仍存在許多錯誤），因此該模型需要在以後重新訓練以提高其性能。

訓練超參數

學習率：3e-05
訓練批次大小：2
評估批次大小：4
隨機種子：42
優化器：AdamW，β=(0.9, 0.999)，ε=1e-08
學習率調度器類型：線性
學習率調度器熱身步數：300
訓練輪數：35
權重衰減：0.01

框架版本

Transformers 4.36.2
Datasets 2.16.1
Tokenizers 0.15.0
Evaluate 0.4.1

🔧 技術細節

模型信息

屬性	詳情
模型類型	Donut 微調模型
基礎模型	naver-clova-ix/donut-base
訓練數據	AdamCodd/donut-receipts

評估指標

指標	值
損失	0.326069（V2）；0.498843（V1）
編輯距離	0.145293（V2）；0.198315（V1）
CER	0.158358（V2）；0.213929（V1）
WER	1.673989（V2）；7.634032（V1）
平均準確率	0.895219（V2）；0.843472（V1）
F1	0.977897（V2）

📄 許可證

本項目 V2 版本採用 cc-by-nc-4.0 許可證。如需商業使用權，請聯繫（adamcoddml@gmail.com）。V1 版本採用 MIT 許可證（在 v1 分支下）。

BibTeX 引用

@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}