Qwen2.5-VL-7B-Instruct-GPTQ-Int3 Open Source Model - Free Deployment for Image-Text Multimodal Conversion Tasks

Qwen2.5 VL 7B Instruct GPTQ Int3

Developed by hfl

This is an unofficial GPTQ-Int3 quantized version based on the Qwen2.5-VL-7B-Instruct model, suitable for multimodal image-text-to-text tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multimodal Image-Text Understanding #GPTQ-Int3 Quantization #Low VRAM Inference

Downloads 577

Release Time : 3/20/2025

Model Overview

This model is a multimodal model capable of processing both image and text inputs to generate text outputs. Primarily designed for image understanding and text generation tasks.

Model Features

Efficient Quantization

Utilizes GPTQ-Int3 quantization technology to significantly reduce model disk usage and VRAM requirements.

Multimodal Support

Capable of processing both image and text inputs for image understanding and text generation.

High Performance

Demonstrates excellent performance on benchmarks like ChartQA and OCRBench.

Strong Compatibility

Compatible with the latest transformers library and allows seamless switching with non-quantized Qwen2.5-VL models.

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Image Captioning

Visual Question Answering

Use Cases

Image Understanding

Image Captioning

Generates detailed textual descriptions from input images

As shown in examples, accurately describes image content and details

Visual Question Answering

Chart Understanding

Answers questions about chart content

Achieved 78.56 score on ChartQA test

Document Processing

OCR Enhancement

Extracts and understands text content from images

Scored 823 on OCRBench test

🚀 Qwen2.5-VL-7B-Instruct-GPTQ-Int3

This is an UNOFFICIAL GPTQ-Int3 quantized version of the Qwen2.5-VL model using the gptqmodel library. The model is compatible with the latest transformers library, which can run non - quantized Qwen2.5 - VL models.

✨ Features

Quantization: It is a GPTQ - Int3 quantized version of the Qwen2.5 - VL model, offering potential computational efficiency.
Compatibility: Compatible with the latest transformers library, enabling seamless integration with existing setups.

📦 Installation

Install the required libraries:

pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils
pip install git+https://github.com/huggingface/optimum.git
pip install gptqmodel

Optionally, you may need to install:

pip install tokenicer device_smi logbar

💻 Usage Examples

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4", 
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://raw.githubusercontent.com/ymcui/Chinese-LLaMA-Alpaca-3/refs/heads/main/pics/banner.png"},
        {"type": "text", "text": "请你描述一下这张图片。"},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text[0])

Response

This image shows a logo in Chinese and English, reading "中文LLaMA & Alpaca大模型" and "Chinese LLaMA & Alpaca Large Language Models". There are two cartoon images on the left side of the logo, one is an alpaca with a red scarf and the other is an alpaca with white fur. The background is a green grassland and a building with a red roof. There is a number 3 on the right side of the logo, along with some circuit patterns. The overall design is simple and clear, using bright colors and cute cartoon images to attract attention.

📚 Documentation

Performance

Model	Size (Disk)	ChartQA (test)	OCRBench
Qwen2.5-VL-3B-Instruct	7.1 GB	83.48	791
Qwen2.5-VL-3B-Instruct-AWQ	3.2 GB	82.52	786
Qwen2.5-VL-3B-Instruct-GPTQ-Int4	3.2 GB	82.56	784
Qwen2.5-VL-3B-Instruct-GPTQ-Int3	2.9 GB	76.68	742
Qwen2.5-VL-7B-Instruct	16.0 GB	83.2	846
Qwen2.5-VL-7B-Instruct-AWQ	6.5 GB	79.68	837
Qwen2.5-VL-7B-Instruct-GPTQ-Int4	6.5 GB	81.48	845
Qwen2.5-VL-7B-Instruct-GPTQ-Int3	5.8 GB	78.56	823

Note

Evaluations are performed using lmms-eval with default setting.
GPTQ models are computationally more effective (fewer VRAM usage, faster inference speed) than AWQ series in these evaluations.
We recommend using the gptqmodel instead of the autogptq library, as autogptq is no longer maintained.

Disclaimer

This is NOT an official model by Qwen. Use at your own risk.
For detailed usage, please check Qwen2.5-VL's page.

📄 License

The model is licensed under Apache - 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご