Qwen2.5-VL-72B-Instruct-Pointer-AWQ Open-Source Vision-Language Model - Enhance Visual Understanding and Structured Output

Qwen2.5 VL 72B Instruct Pointer AWQ

Developed by PointerHQ

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and structured output generation.

Image-to-Text

Transformers

EnglishOpen Source License:Other #Multimodal Video Understanding #Visual Agent Tool Calling #Dynamic Resolution Processing

Downloads 5,592

Release Time : 2/9/2025

Model Overview

Qwen2.5-VL is a multimodal vision-language model excelling in image-text-to-text tasks, supporting visual grounding, long-video understanding, and structured output generation.

Model Features

Enhanced Visual Understanding

Not only recognizes common objects but also performs in-depth analysis of text, charts, icons, graphics, and layouts within images.

Agent Capabilities

Can directly function as a visual agent, performing reasoning and dynamically calling tools, with the ability to operate computers and mobile devices.

Long-Video Understanding and Event Capture

Capable of understanding videos exceeding 1 hour in length and newly added the ability to capture events by precisely locating relevant video segments.

Multiple Visual Grounding Formats

Can accurately locate objects in images by generating bounding boxes or points and stably output coordinates and attributes in JSON format.

Structured Output Generation

Supports structured output for scanned documents like invoices and forms, benefiting applications in finance, business, and other fields.

Model Capabilities

Image-Text Understanding

Visual Grounding

Long-Video Analysis

Structured Data Extraction

Multimodal Reasoning

Tool Calling

Use Cases

Business & Finance

Invoice Processing

Automatically extracts structured data from invoices

Improves financial processing efficiency

Table Analysis

Parses table data from scanned documents

Simplifies data entry workflows

Video Analysis

Long-Video Understanding

Analyzes video content exceeding 1 hour

Precisely locates specific event segments

Visual Agent

Computer Operation

Guides computer operations through visual understanding

Automates workflows

🚀 Qwen2.5-VL-72B-Instruct-Pointer-AWQ

This model addresses the issue where the official Qwen/Qwen2.5-VL-72B-Instruct-AWQ doesn't support tensor parallel on vllm. It enables --tensor-parallel with 2, 4, or 8 GPUs. Use vllm==0.7.3.

✨ Features

Key Enhancements

Understand things visually: Qwen2.5-VL can recognize common objects and analyze texts, charts, icons, graphics, and layouts in images.
Being agentic: It acts as a visual agent, capable of reasoning and directing tools for computer and phone use.
Understanding long videos and capturing events: The model can comprehend videos over 1 hour and pinpoint relevant video segments.
Capable of visual localization in different formats: It can accurately localize objects in images and provide stable JSON outputs for coordinates and attributes.
Generating structured outputs: For data like scans of invoices, forms, and tables, Qwen2.5-VL supports structured outputs, beneficial for finance and commerce.

Model Architecture Updates

Dynamic Resolution and Frame Rate Training for Video Understanding: By adopting dynamic FPS sampling, the model can understand videos at various sampling rates. mRoPE is updated in the time dimension to enable the model to learn temporal sequence and speed.
Streamlined and Efficient Vision Encoder: Window attention is implemented in the ViT to enhance training and inference speeds. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning with the Qwen2.5 LLM.

There are three models with 3, 7, and 72 billion parameters. This repo contains the instruction - tuned 72B Qwen2.5 - VL model. For more information, visit our Blog and GitHub.

📦 Installation

The code of Qwen2.5-VL is in the latest Hugging face transformers. It is recommended to build from source with the following command:

pip install git+https://github.com/huggingface/transformers accelerate

Otherwise, you might encounter the following error:

KeyError: 'qwen2_5_vl'

💻 Usage Examples

Basic Usage

We offer a toolkit to handle various visual inputs more conveniently. Install it using the following command:

# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils[decord]==0.0.8

If you are not using Linux, you might not be able to install decord from PyPI. In that case, use pip install qwen-vl-utils which will fall back to using torchvision for video processing. However, you can still install decord from source to use decord when loading videos.

Using 🤗 Transformers to Chat

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi - image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# The default range for the number of visual tokens per image in the model is 4 - 16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256 - 1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage

Multi image inference

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Video inference

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a video url and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Video URL compatibility largely depends on the third - party library version. The details are in the table below. Change the backend by FORCE_QWENVL_VIDEO_READER=torchvision or FORCE_QWENVL_VIDEO_READER=decord if you prefer not to use the default one.

Backend	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

Batch inference

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

More Usage Tips

Input Formats: For input images, local files, base64, and URLs are supported. For videos, currently only local files are supported.

# You can directly insert a local file path, a URL, or a base64 - encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

Image Resolution for performance boost: The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256 - 1280, to balance speed and memory usage.

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

There are two methods for fine - grained control over the image size input to the model: 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels. 2. Specify exact dimensions: Directly set resized_height and resized_width. These values will be rounded to the nearest multiple of 28.

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

Processing Long Texts: The current config.json is set for a context length of up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, YaRN is utilized. For supported frameworks, add the following to config.json to enable YaRN:

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks.

📚 Documentation

Evaluation

Image benchmark

Benchmarks	GPT4o	Claude3.5 Sonnet	Gemini - 2 - flash	InternVL2.5 - 78B	Qwen2 - VL - 72B	Qwen2.5 - VL - 72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench - V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC - OCR	66.6	62.7	73.0	64.7	68.7	79.8

Video benchmark

Benchmarks	GPT4o	Gemini - 1.5 - Pro	InternVL2.5 - 78B	Qwen2VL - 72B	Qwen2.5VL - 72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench - Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M - Avg_dev	64.6	-	75.7		74.6
TempCompass_overall	73.8	-	-		74.8

Agent benchmark

Benchmarks	GPT4o	Gemini 2.0	Claude	Aguvis - 72B	Qwen2VL - 72B	Qwen2.5VL - 72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

📄 License

License: Other
License Name: qwen
License Link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご