đ FIX bug in Qwen/Qwen2.5-VL-72B-Instruct-AWQ
This repository is a fork of Qwen/Qwen2.5-VL-72B-Instruct-AWQ, with identical weights. It fixes this issue in the original model by applying a patch to preprocessor_config.json.
đ Quick Start
Prerequisites
The code of Qwen2.5-VL is included in the latest Hugging Face Transformers. It's recommended to build from source using the following command:
pip install git+https://github.com/huggingface/transformers accelerate
Otherwise, you might encounter the following error:
KeyError: 'qwen2_5_vl'
You can also install a toolkit to handle various visual inputs more conveniently:
pip install qwen-vl-utils[decord]==0.0.8
If you're not using Linux, you may not be able to install decord
from PyPI. In that case, use pip install qwen-vl-utils
, which will fall back to using torchvision for video processing. You can still install decord from source to use it when loading videos.
Using đ¤ Transformers to Chat
Here is a code snippet showing how to use the chat model with transformers
and qwen_vl_utils
:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-72B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
đ¤ ModelScope
We strongly recommend users, especially those in mainland China, to use ModelScope. snapshot_download
can help you solve issues related to downloading checkpoints.
⨠Features
Key Enhancements
- Understand things visually: Qwen2.5-VL can not only recognize common objects like flowers, birds, fish, and insects but also analyze texts, charts, icons, graphics, and layouts within images.
- Being agentic: It acts as a visual agent, capable of reasoning and dynamically directing tools for computer and phone use.
- Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos over 1 hour long and has the new ability to capture events by pinpointing relevant video segments.
- Capable of visual localization in different formats: It can accurately localize objects in an image by generating bounding boxes or points and provide stable JSON outputs for coordinates and attributes.
- Generating structured outputs: For data such as scans of invoices, forms, and tables, Qwen2.5-VL supports structured outputs of their contents, which is beneficial for finance, commerce, etc.
Model Architecture Updates
- Dynamic Resolution and Frame Rate Training for Video Understanding:
We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. We update mRoPE in the time dimension with IDs and absolute time alignment, allowing the model to learn temporal sequence and speed and ultimately acquire the ability to pinpoint specific moments.
- Streamlined and Efficient Vision Encoder
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
There are three models with 3, 7, and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2.5-VL model. For more information, visit our Blog and GitHub.
đ Documentation
More Usage Tips
Input Formats
For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]
Image Resolution for performance boost
The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
)
We provide two methods for fine-grained control over the image size input to the model:
- Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
- Specify exact dimensions: Directly set
resized_height
and resized_width
. These values will be rounded to the nearest multiple of 28.
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"resized_height": 280,
"resized_width": 420,
},
{"type": "text", "text": "Describe this image."},
],
}
]
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"min_pixels": 50176,
"max_pixels": 50176,
},
{"type": "text", "text": "Describe this image."},
],
}
]
Processing Long Texts
The current config.json
is set for a context length of up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we use YaRN, a technique for enhancing model length extrapolation, to ensure optimal performance on long texts.
For supported frameworks, you can add the following to config.json
to enable YaRN:
{
...,
"type": "yarn",
"mrope_section": [
16,
24,
24
],
"factor": 4,
"original_max_position_embeddings": 32768
}
However, note that this method significantly impacts the performance of temporal and spatial localization tasks and is not recommended.
For long video inputs, since MRoPE itself is more economical with ids, you can directly modify max_position_embeddings
to a larger value, such as 64k.
Benchmark
Performance of Quantized Models
This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:
- MMMU_VAL (Accuracy)
- DocVQA_VAL (Accuracy)
- MMBench_DEV_EN (Accuracy)
- MathVista_MINI (Accuracy)
We use VLMEvalkit to evaluate all models.
Model Size |
Quantization |
MMMU_VAL |
DocVQA_VAL |
MMBench_EDV_EN |
MathVista_MINI |
Qwen2.5-VL-72B-Instruct |
BF16 (đ¤đ¤) |
70.0 |
96.1 |
88.2 |
75.3 |
|
AWQ (đ¤đ¤) |
69.1 |
96.0 |
87.9 |
73.8 |
Qwen2.5-VL-7B-Instruct |
BF16 (đ¤đ¤) |
58.4 |
94.9 |
84.1 |
67.9 |
|
AWQ (đ¤đ¤) |
55.6 |
94.6 |
84.2 |
64.7 |
Qwen2.5-VL-3B-Instruct |
BF16 (đ¤đ¤) |
51.7 |
93.0 |
79.8 |
61.4 |
|
AWQ (đ¤đ¤) |
49.1 |
91.8 |
78.0 |
58.8 |
đ License
This project is licensed under the Qwen License.
đ Citation
If you find our work helpful, please cite us:
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}