Qwen2.5-VL-72B-Instruct-AWQ-fix Open-source Model - Free Support for Multi-format Visual Localization and Structured Output

Qwen2.5 VL 72B Instruct AWQ Fix

Developed by Benasd

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring powerful visual understanding and agent capabilities, supporting multi-format visual localization and structured output generation.

Image-to-Text

Transformers

EnglishOpen Source License:Other #Multimodal Visual Understanding #Long Video Event Capture #Structured Data Extraction

Downloads 94

Release Time : 2/26/2025

Model Overview

Qwen2.5-VL is a multimodal vision-language model proficient in tasks such as image and video understanding, text analysis, chart parsing, and applicable across various fields including finance and business.

Model Features

Visual Understanding Capability

Not only recognizes common objects but also efficiently analyzes text, charts, icons, graphics, and layouts within images.

Agent Capability

Can directly function as a visual agent, performing reasoning and dynamically invoking tools, supporting operations on computers and mobile devices.

Long Video Understanding and Event Capture

Capable of understanding videos exceeding one hour and newly added the ability to capture events by precisely locating relevant segments.

Multi-format Visual Localization

Can accurately annotate objects in images by generating bounding boxes or points, and stably output coordinates and attributes in JSON format.

Structured Output Generation

Supports structured content output for scanned documents such as invoices and tables, applicable in fields like finance and business.

Model Capabilities

Image Understanding

Video Understanding

Text Analysis

Chart Parsing

Visual Localization

Structured Output Generation

Use Cases

Finance

Invoice Processing

Automatically parses invoice content and generates structured data

Improves data processing efficiency and accuracy

Business

Table Parsing

Extracts table data from scanned documents

Simplifies data entry processes

Multimedia

Video Content Analysis

Understands long video content and locates key events

Enhances video retrieval efficiency

🚀 FIX bug in Qwen/Qwen2.5-VL-72B-Instruct-AWQ

This repository is a fork of Qwen/Qwen2.5-VL-72B-Instruct-AWQ, with identical weights. It fixes this issue in the original model by applying a patch to preprocessor_config.json.

🚀 Quick Start

Prerequisites

The code of Qwen2.5-VL is included in the latest Hugging Face Transformers. It's recommended to build from source using the following command:

pip install git+https://github.com/huggingface/transformers accelerate

Otherwise, you might encounter the following error:

KeyError: 'qwen2_5_vl'

You can also install a toolkit to handle various visual inputs more conveniently:

# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils[decord]==0.0.8

If you're not using Linux, you may not be able to install decord from PyPI. In that case, use pip install qwen-vl-utils, which will fall back to using torchvision for video processing. You can still install decord from source to use it when loading videos.

Using 🤗 Transformers to Chat

Here is a code snippet showing how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct-AWQ",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

🤖 ModelScope

We strongly recommend users, especially those in mainland China, to use ModelScope. snapshot_download can help you solve issues related to downloading checkpoints.

✨ Features

Key Enhancements

Understand things visually: Qwen2.5-VL can not only recognize common objects like flowers, birds, fish, and insects but also analyze texts, charts, icons, graphics, and layouts within images.
Being agentic: It acts as a visual agent, capable of reasoning and dynamically directing tools for computer and phone use.
Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos over 1 hour long and has the new ability to capture events by pinpointing relevant video segments.
Capable of visual localization in different formats: It can accurately localize objects in an image by generating bounding boxes or points and provide stable JSON outputs for coordinates and attributes.
Generating structured outputs: For data such as scans of invoices, forms, and tables, Qwen2.5-VL supports structured outputs of their contents, which is beneficial for finance, commerce, etc.

Model Architecture Updates

Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. We update mRoPE in the time dimension with IDs and absolute time alignment, allowing the model to learn temporal sequence and speed and ultimately acquire the ability to pinpoint specific moments.

Streamlined and Efficient Vision Encoder We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.

There are three models with 3, 7, and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2.5-VL model. For more information, visit our Blog and GitHub.

📚 Documentation

More Usage Tips

Input Formats

For input images, we support local files, base64, and URLs. For videos, we currently only support local files.

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

Image Resolution for performance boost

The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
)

We provide two methods for fine-grained control over the image size input to the model:

Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
Specify exact dimensions: Directly set resized_height and resized_width. These values will be rounded to the nearest multiple of 28.

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

Processing Long Texts

The current config.json is set for a context length of up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we use YaRN, a technique for enhancing model length extrapolation, to ensure optimal performance on long texts. For supported frameworks, you can add the following to config.json to enable YaRN:

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

However, note that this method significantly impacts the performance of temporal and spatial localization tasks and is not recommended.

For long video inputs, since MRoPE itself is more economical with ids, you can directly modify max_position_embeddings to a larger value, such as 64k.

Benchmark

Performance of Quantized Models

This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:

MMMU_VAL (Accuracy)
DocVQA_VAL (Accuracy)
MMBench_DEV_EN (Accuracy)
MathVista_MINI (Accuracy)

We use VLMEvalkit to evaluate all models.

Model Size	Quantization	MMMU_VAL	DocVQA_VAL	MMBench_EDV_EN	MathVista_MINI
Qwen2.5-VL-72B-Instruct	BF16 ^(🤗🤖)	70.0	96.1	88.2	75.3
	AWQ ^(🤗🤖)	69.1	96.0	87.9	73.8
Qwen2.5-VL-7B-Instruct	BF16 ^(🤗🤖)	58.4	94.9	84.1	67.9
	AWQ ^(🤗🤖)	55.6	94.6	84.2	64.7
Qwen2.5-VL-3B-Instruct	BF16 ^(🤗🤖)	51.7	93.0	79.8	61.4
	AWQ ^(🤗🤖)	49.1	91.8	78.0	58.8

📄 License

This project is licensed under the Qwen License.

📚 Citation

If you find our work helpful, please cite us:

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご