đ Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset
A Vision-Language Model optimized for extracting structured financial information from cheque images.
đ Quick Start
To get started with the model, you first need to install the necessary dependencies:
pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
Using đ¤ Transformers to Chat
Here is a code snippet demonstrating how to use the chat model with transformers
and qwen_vl_utils
:
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16)
MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "/kaggle/input/testch/Handwritten-legal-amount.png",
},
{"type": "text", "text": "extract in json"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
⨠Features
- Structured Financial Information Extraction: The model is specifically designed to extract structured financial details from cheque images and generate JSON - formatted outputs.
- Optimized for Cheque Processing: Fine - tuned on a cheque - specific dataset, it can accurately process cheque visuals.
- ChatML Format Compatibility: Follows the ChatML format, which is useful for chat - based interactions.
đĻ Installation
The installation command for the required libraries is as follows:
pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
đģ Usage Examples
Basic Usage
The provided Python code above shows a basic example of using the model to extract information from a cheque image and output it in JSON format.
đ Documentation
Model Details
Model Description
The Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) for extracting structured financial details from cheque images. It processes cheque visuals and outputs structured JSON with key details such as check number, beneficiary, total amount, and issue dates. The model follows the ChatML format and has been fine - tuned on a cheque - specific dataset to improve accuracy in financial document processing.
Property |
Details |
Developed by |
Independent fine - tuning on Qwen2.5-VL-3B-Instruct |
Model Type |
Vision - Language Model for cheque information extraction |
Language(s) (NLP) |
Primarily English (optimized for financial terminology) |
License |
[More Information Needed] |
Finetuned from model |
Qwen/Qwen2.5-VL-3B-Instruct |

Uses
The model is intended for automated cheque processing and structured data extraction. It can be used in the following scenarios:
- Banking and Financial Services â Automating cheque verification and processing.
- Accounting and Payroll Systems â Extracting financial details for record - keeping.
- AI - powered OCR Pipelines â Enhancing traditional OCR systems with structured output.
- Enterprise Document Management â Automating financial data extraction from scanned cheques.
Direct Use
The model can be further fine - tuned or integrated into larger applications such as:
- Custom AI - powered financial processing tools
- Multi - document parsing workflows for financial institutions
- Intelligent chatbots for banking automation
Out - of - Scope Use
- General OCR applications unrelated to cheques: The model is optimized specifically for cheque image processing and may not perform well on other document types.
- Handwritten cheque recognition: The model primarily works with printed cheques and may struggle with cursive handwriting.
- Non - English cheque processing: While it is trained in English financial contexts, it may not generalize well to cheques in other languages.
đ§ Technical Details
Training Data
The dataset consists of cheque images and corresponding JSON annotations in the following format:
{
"image": "1.png",
"prefix": "Format the json as shown below",
"suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}
The images are stored in a folder, and the annotations are structured JSON specifying cheque details like check number, beneficiary, amount, client issue date, and bank issue date.
Training Procedure
The model configuration sets the minimum and maximum pixel limits for image processing to ensure compatibility with the Qwen2.5-VLProcessor. The processor is initialized with these constraints using a pre - trained model ID. The Qwen2.5-VLForConditionalGeneration model is then loaded with Torch data type set to bfloat16 for optimized performance.
Finally, LoRA (Low - Rank Adaptation) is applied to the model using get_peft_model, reducing memory overhead while fine - tuning specific layers.
config = {
"max_epochs": 4,
"batch_size": 1,
"lr": 2e-4,
"check_val_every_n_epoch": 2,
"gradient_clip_val": 1.0,
"accumulate_grad_batches": 8,
"num_nodes": 1,
"warmup_steps": 50,
"result_path": "qwen2.5-3b-instruct-cheque-manifest"
}
Compute Infrastructure
GPU: NVIDIA A100
đ License
[More Information Needed]
Citation
If you find our work helpful, feel free to give us a cite.
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}