qwen-vl-2.5-3B-finetuned-cheque, an open-source vision-language model, enables free extraction of cheque financial information!

Qwen Vl 2.5 3B Finetuned Cheque

Developed by AJNG

A vision-language model specifically designed to extract structured financial information from cheque images and generate JSON-formatted output containing key information such as cheque number, payee, amount, and issue date.

Image-to-Text

Transformers

English#Check information extraction #Structured JSON output #Financial document processing

Downloads 170

Release Time : 2/18/2025

Model Overview

This model is a vision-language model fine-tuned based on Qwen2.5-VL-3B-Instruct, focusing on cheque image processing, capable of accurately extracting financial information and generating structured JSON output.

Model Features

Targeted optimization

Fine-tuned on a personal cheque dataset, specifically used to extract structured financial information from cheque images

Structured output

After processing cheque images, generate JSON-formatted output containing key information such as cheque number, payee, amount, and issue date

Multi-domain application

Can be applied to multiple fields such as banking and financial services, accounting and payroll systems, AI OCR pipelines, and enterprise document management

Efficient fine-tuning

Use LoRA (Low-Rank Adaptation) technology for fine-tuning to reduce memory overhead

Model Capabilities

Cheque image analysis

Financial information extraction

Structured JSON generation

Vision-language understanding

Use Cases

Banking and financial services

Automated cheque verification

Automatically verify cheque information to improve processing efficiency

Reduce manual verification time

Cheque processing automation

Process cheque images in batches to extract key information

Improve processing speed and accuracy

Accounting and payroll systems

Financial record keeping

Automatically extract cheque information for accounting records

Reduce manual entry errors

AI OCR pipeline

Enhance traditional OCR systems

Enhance the functionality of traditional OCR systems through structured output

Provide more comprehensive output information

Enterprise document management

Financial data extraction

Automatically extract financial data from scanned cheques

Simplify document management processes

🚀 Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset

A Vision-Language Model optimized for extracting structured financial information from cheque images.

🚀 Quick Start

To get started with the model, you first need to install the necessary dependencies:

  pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

Using 🤗 Transformers to Chat

Here is a code snippet demonstrating how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ Features

Structured Financial Information Extraction: The model is specifically designed to extract structured financial details from cheque images and generate JSON - formatted outputs.
Optimized for Cheque Processing: Fine - tuned on a cheque - specific dataset, it can accurately process cheque visuals.
ChatML Format Compatibility: Follows the ChatML format, which is useful for chat - based interactions.

📦 Installation

The installation command for the required libraries is as follows:

  pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

💻 Usage Examples

Basic Usage

The provided Python code above shows a basic example of using the model to extract information from a cheque image and output it in JSON format.

📚 Documentation

Model Details

Model Description

The Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) for extracting structured financial details from cheque images. It processes cheque visuals and outputs structured JSON with key details such as check number, beneficiary, total amount, and issue dates. The model follows the ChatML format and has been fine - tuned on a cheque - specific dataset to improve accuracy in financial document processing.

Property	Details
Developed by	Independent fine - tuning on Qwen2.5-VL-3B-Instruct
Model Type	Vision - Language Model for cheque information extraction
Language(s) (NLP)	Primarily English (optimized for financial terminology)
License	[More Information Needed]
Finetuned from model	Qwen/Qwen2.5-VL-3B-Instruct

image/png

Uses

The model is intended for automated cheque processing and structured data extraction. It can be used in the following scenarios:

Banking and Financial Services – Automating cheque verification and processing.
Accounting and Payroll Systems – Extracting financial details for record - keeping.
AI - powered OCR Pipelines – Enhancing traditional OCR systems with structured output.
Enterprise Document Management – Automating financial data extraction from scanned cheques.

Direct Use

The model can be further fine - tuned or integrated into larger applications such as:

Custom AI - powered financial processing tools
Multi - document parsing workflows for financial institutions
Intelligent chatbots for banking automation

Out - of - Scope Use

General OCR applications unrelated to cheques: The model is optimized specifically for cheque image processing and may not perform well on other document types.
Handwritten cheque recognition: The model primarily works with printed cheques and may struggle with cursive handwriting.
Non - English cheque processing: While it is trained in English financial contexts, it may not generalize well to cheques in other languages.

🔧 Technical Details

Training Data

The dataset consists of cheque images and corresponding JSON annotations in the following format:

{
  "image": "1.png", 
  "prefix": "Format the json as shown below",  
  "suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}

The images are stored in a folder, and the annotations are structured JSON specifying cheque details like check number, beneficiary, amount, client issue date, and bank issue date.

Training Procedure

The model configuration sets the minimum and maximum pixel limits for image processing to ensure compatibility with the Qwen2.5-VLProcessor. The processor is initialized with these constraints using a pre - trained model ID. The Qwen2.5-VLForConditionalGeneration model is then loaded with Torch data type set to bfloat16 for optimized performance.

Finally, LoRA (Low - Rank Adaptation) is applied to the model using get_peft_model, reducing memory overhead while fine - tuning specific layers.

config = {
    "max_epochs": 4,
    "batch_size": 1,
    "lr": 2e-4,
    "check_val_every_n_epoch": 2,
    "gradient_clip_val": 1.0,
    "accumulate_grad_batches": 8,
    "num_nodes": 1,
    "warmup_steps": 50,
    "result_path": "qwen2.5-3b-instruct-cheque-manifest"
}

Compute Infrastructure

GPU: NVIDIA A100

📄 License

[More Information Needed]

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}
@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご