Qwen2.5-VL-72B-Instruct-AWQ开源多模态模型 - 支持多格式输入，视觉理解强

首页

Qwen2.5 VL 72B Instruct AWQ

由 Benasd 开发

Qwen2.5-VL是通义千问团队推出的多模态大语言模型，具备强大的视觉理解和智能代理能力，支持图像、视频、文本等多种输入格式。

文本生成图像

Transformers

英语开源协议:其他 #多模态视觉理解 #长视频分析 #智能代理控制

下载量 173

发布时间 : 2/13/2025

模型简介

Qwen2.5-VL是通义千问系列的最新视觉语言模型，专注于提升视觉理解、智能代理和结构化输出能力，适用于金融、商业等多个领域。

模型特点

增强视觉理解

精准分析图像中的文本、图表、图标、图形和布局，超越常见物体识别

智能代理能力

可直接作为视觉代理进行推理并动态调用工具，具备计算机和手机操作能力

长视频理解

可理解超过1小时的视频内容，新增精准定位相关视频片段的事件捕捉能力

多格式视觉定位

通过生成边界框或点坐标精确定位图像中的物体，稳定输出JSON格式数据

结构化输出

支持发票、表格等数据的结构化内容输出，适用于金融、商业等领域

模型能力

图像理解

视频理解

文本识别

图表分析

智能代理

视觉定位

结构化数据提取

使用案例

商业分析

发票处理

自动识别和提取发票中的关键信息

实现财务数据自动化录入

商业报告分析

解析商业报告中的图表和数据

快速生成业务洞察

智能代理

手机操作自动化

通过视觉指令控制手机应用

实现自动化测试和操作

教育

数学题目解答

解析包含图表和公式的数学题目

提供分步解答过程

🚀 Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct是Qwen家族最新的多模态模型，具备强大的图像、视频理解能力以及视觉代理能力。它能分析图像中的文本、图表等元素，理解长视频并捕捉事件，还可进行视觉定位和生成结构化输出，为多模态应用提供了有力支持。

🚀 快速开始

多GPU推理

使用以下docker命令进行多GPU推理：

docker run -it --name iddt-ben-qwen25vl72 --gpus '"device=0,1"' -v huggingface:/root/.cache/huggingface --shm-size=32g -p 30000:8000 --ipc=host benasd/vllm:latest --model Benasd/Qwen2.5-VL-72B-Instruct-AWQ  --dtype float16 --quantization awq -tp 2

安装依赖

Qwen2.5-VL的代码已集成在最新的Hugging face transformers库中，建议使用以下命令从源代码构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

使用工具包

为了更方便地处理各种类型的视觉输入，可安装以下工具包：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

如果不使用Linux系统，可能无法从PyPI安装decord。此时，可以使用pip install qwen-vl-utils，它将回退到使用torchvision进行视频处理。不过，仍然可以从源代码安装decord，以便在加载视频时使用decord。

使用🤗 Transformers进行对话

以下是使用transformers和qwen_vl_utils进行对话的代码示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 可以根据需要设置 min_pixels 和 max_pixels，例如令牌范围为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多张图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频 URL 和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频 URL 兼容性在很大程度上取决于第三方库的版本。详情如下表所示。如果不想使用默认的后端，可以通过FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord来更改。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 使用ModelScope

强烈建议用户（特别是中国大陆的用户）使用ModelScope。snapshot_download可以帮助解决下载检查点时遇到的问题。

✨ 主要特性

关键增强功能

视觉理解：Qwen2.5-VL不仅擅长识别常见物体，如花鸟鱼虫，还能对图像中的文本、图表、图标、图形和布局进行深度分析。
智能代理：可直接作为视觉代理，具备推理能力并能动态调用工具，支持计算机和手机的使用场景。
长视频理解与事件捕捉：能够理解超过1小时的视频，并具备定位相关视频片段以捕捉事件的新能力。
多格式视觉定位：可以通过生成边界框或点的方式在图像中准确定位物体，并为坐标和属性提供稳定的JSON输出。
结构化输出生成：对于发票、表单、表格等扫描数据，支持生成其内容的结构化输出，适用于金融、商业等领域。

模型架构更新

动态分辨率和帧率训练以支持视频理解：通过采用动态FPS采样将动态分辨率扩展到时间维度，使模型能够理解不同采样率的视频。同时，在时间维度上使用ID和绝对时间对齐更新mRoPE，让模型学习时间序列和速度，最终获得定位特定时刻的能力。
精简高效的视觉编码器：将窗口注意力策略性地应用于ViT，提高了训练和推理速度。同时，使用SwiGLU和RMSNorm进一步优化ViT架构，使其与Qwen2.5 LLM的结构保持一致。

目前有参数分别为30亿、70亿和720亿的三个模型。本仓库包含经过指令微调的72B Qwen2.5-VL模型。更多信息请访问博客和GitHub。

📚 详细文档

评估指标

图像基准测试

基准测试	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

视频基准测试

基准测试	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M-Avg_dev	64.6	-	75.7		74.6
TempCompass_overall	73.8	-	-		74.8

代理基准测试

基准测试	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

🔧 技术细节

模型信息

属性	详情
模型类型	多模态图像文本生成模型
训练数据	未提及
基础模型	Qwen/Qwen2.5-VL-72B-Instruct
库名称	transformers
管道标签	image-text-to-text
标签	multimodal

许可证信息

本项目采用Qwen许可证。

📄 许可证

本项目使用的许可证为Qwen许可证。

📖 引用

如果您觉得我们的工作有帮助，请引用以下文献：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}