qwen-vl-2.5-3B-finetuned-cheque開源視覺語言模型，免費提取支票金融信息！

首頁

Qwen Vl 2.5 3B Finetuned Cheque

由AJNG開發

一款視覺語言模型，專門用於從支票圖像中提取結構化的金融信息，生成包含支票號碼、收款人、金額和簽發日期等關鍵信息的JSON格式輸出。

圖像生成文本

Transformers

英語#支票信息提取 #結構化JSON輸出 #金融文檔處理

下載量 170

發布時間 : 2/18/2025

模型概述

該模型是基於Qwen2.5-VL-3B-Instruct微調的視覺語言模型，專注於支票圖像處理，能夠準確提取金融信息並生成結構化JSON輸出。

模型特點

針對性優化

基於個人支票數據集微調，專門用於從支票圖像中提取結構化的金融信息

結構化輸出

處理支票圖像後，生成包含支票號碼、收款人、金額和簽發日期等關鍵信息的JSON格式輸出

多領域應用

可應用於銀行金融服務、會計和工資系統、AI OCR管道以及企業文檔管理等多個領域

高效微調

使用LoRA（低秩適應）技術進行微調，減少內存開銷

模型能力

支票圖像分析

金融信息提取

結構化JSON生成

視覺語言理解

使用案例

銀行和金融服務

自動化支票驗證

自動驗證支票信息，提高處理效率

減少人工驗證時間

支票處理自動化

批量處理支票圖像，提取關鍵信息

提高處理速度和準確性

會計和工資系統

金融記錄保存

自動提取支票信息用於會計記錄

減少人工錄入錯誤

AI OCR管道

增強傳統OCR系統

通過結構化輸出增強傳統OCR系統的功能

提供更豐富的輸出信息

企業文檔管理

金融數據提取

從掃描的支票中自動提取金融數據

簡化文檔管理流程

🚀 基於個人支票數據集微調的Qwen2.5-VL-3B-Instruct模型

本模型是一款視覺語言模型（VLM），專門用於從支票圖像中提取結構化的金融信息。它能夠處理支票圖像，並生成包含支票號碼、收款人、金額和簽發日期等關鍵信息的JSON格式輸出。

🚀 快速開始

安裝依賴庫

pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

使用transformers庫進行對話

以下是一個代碼片段，展示瞭如何使用transformers和qwen_vl_utils庫來使用該對話模型：

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
# 推理前的準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ 主要特性

針對性優化：基於個人支票數據集對Qwen2.5-VL-3B-Instruct進行微調，專門用於從支票圖像中提取結構化的金融信息。
結構化輸出：處理支票圖像後，生成包含支票號碼、收款人、金額和簽發日期等關鍵信息的JSON格式輸出。
多領域應用：可應用於銀行金融服務、會計和工資系統、AI OCR管道以及企業文檔管理等多個領域。

📦 安裝指南

pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

💻 使用示例

基礎用法

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
# 推理前的準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📚 詳細文檔

模型詳情

模型描述

基於個人支票數據集微調的Qwen2.5-VL-3B-Instruct是一款視覺語言模型（VLM），旨在從支票圖像中提取結構化的金融信息。它處理支票圖像，並輸出包含支票號碼、收款人、總金額和簽發日期等關鍵信息的結構化JSON。該模型遵循ChatML格式，並在特定的支票數據集上進行了微調，以提高金融文檔處理的準確性。

開發者：對Qwen2.5-VL-3B-Instruct進行獨立微調
模型類型：用於支票信息提取的視覺語言模型
語言：主要為英語（針對金融術語進行了優化）
許可證：[需要更多信息]
微調基礎模型：Qwen/Qwen2.5-VL-3B-Instruct

image/png

用途

該模型旨在用於自動支票處理和結構化數據提取。它可以分析支票圖像，並生成包含關鍵金融信息的JSON格式輸出。該模型可應用於以下領域：

銀行和金融服務：自動化支票驗證和處理。
會計和工資系統：提取金融信息進行記錄保存。
AI OCR管道：通過結構化輸出增強傳統OCR系統。
企業文檔管理：從掃描的支票中自動提取金融數據。

直接使用

該模型可以進一步微調或集成到更大的應用程序中，例如：

自定義AI金融處理工具
金融機構的多文檔解析工作流程
用於銀行自動化的智能聊天機器人

適用範圍外的使用

與支票無關的通用OCR應用：該模型專門針對支票圖像處理進行了優化，可能在其他文檔類型上表現不佳。
手寫支票識別：該模型主要處理打印支票，可能難以處理草書手寫體。
非英語支票處理：雖然它在英語金融環境中進行了訓練，但可能無法很好地推廣到其他語言的支票。

訓練詳情

訓練數據

數據集由支票圖像和相應的JSON註釋組成，格式如下：

{
  "image": "1.png", 
  "prefix": "Format the json as shown below",  
  "suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}

圖像文件夾：包含相應的支票圖像。
註釋：結構化JSON，指定支票詳細信息，如支票號碼、收款人、金額、客戶簽發日期和銀行簽發日期。

訓練過程

模型配置設置了圖像處理的最小和最大像素限制，確保與Qwen2.5-VLProcessor兼容。處理器使用預訓練的模型ID進行初始化，並設置這些約束。然後，使用Torch數據類型設置為bfloat16加載Qwen2.5-VLForConditionalGeneration模型，以實現優化性能。

最後，使用get_peft_model對模型應用LoRA（低秩適應），在微調特定層時減少內存開銷。

config = {
    "max_epochs": 4,
    "batch_size": 1,
    "lr": 2e-4,
    "check_val_every_n_epoch": 2,
    "gradient_clip_val": 1.0,
    "accumulate_grad_batches": 8,
    "num_nodes": 1,
    "warmup_steps": 50,
    "result_path": "qwen2.5-3b-instruct-cheque-manifest"
}

計算基礎設施

GPU：NVIDIA A100

🔧 技術細節

模型配置

LoRA應用

最後，使用get_peft_model對模型應用LoRA（低秩適應），在微調特定層時減少內存開銷。

📄 許可證

[需要更多信息]

📚 引用

如果您覺得我們的工作有幫助，請隨意引用我們的工作。

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}
@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}