Qwen2.5-VL-32B-Instruct开源视觉语言模型 - 支持多模态任务智能处理

首页

Space Model

由 Alhdrawi 开发

Qwen2.5-VL-32B-Instruct是Qwen家族的最新视觉语言模型，具备强大的视觉理解和智能代理能力，支持多模态任务处理。

图像生成文本

Transformers

支持多种语言开源协议:Apache-2.0 #多模态视觉理解 #长视频事件定位 #结构化数据输出

下载量 58

发布时间 : 3/31/2025

模型简介

Qwen2.5-VL-32B-Instruct是一个320亿参数的视觉语言模型，专注于提升视觉理解、数学推理和问题解决能力，支持图像、视频和文本的多模态交互。

模型特点

增强的视觉理解能力

不仅能识别常见物体，还擅长分析图像中的文本、图表、图标、图形和布局。

智能代理能力

可直接作为视觉代理，动态调用工具，支持计算机和手机操作。

长视频理解与事件捕捉

能解析超过1小时的视频，新增精准定位相关片段的能力。

多格式视觉定位

通过生成边界框或点坐标精确定位图像对象，并输出稳定的JSON格式坐标和属性。

结构化输出

支持发票、表格等扫描数据的结构化输出，适用于金融、商业等场景。

模型能力

图像分析

视频理解

文本生成

数学推理

逻辑推理

知识问答

视觉定位

智能代理

使用案例

金融与商业

发票处理

自动识别和结构化输出发票信息

准确率高达96.4%（DocVQA数据集）

教育

数学问题解答

解析和解答包含图表和公式的数学问题

MathVista数据集得分74.7

视频分析

长视频内容理解

解析超过1小时的视频内容并定位关键事件

LVBench得分49.00

🚀 Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct是一款强大的视觉语言模型，它在数学和问题解决能力上表现出色，能处理多种视觉输入，为用户提供精准的回答，适用于图像识别、视频分析、知识问答等多个领域。

🚀 快速开始

安装依赖

Qwen2.5-VL的代码已集成在最新的Hugging face transformers中，建议使用以下命令从源代码进行构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

同时，我们提供了一个工具包，帮助你更方便地处理各种类型的视觉输入，你可以使用以下命令进行安装：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

如果你不使用Linux系统，可能无法从PyPI安装decord，这种情况下可以使用pip install qwen-vl-utils，它将回退到使用torchvision进行视频处理。不过，你仍然可以从源代码安装decord，以便在加载视频时使用decord。

使用示例

使用🤗 Transformers进行对话

以下是一个使用transformers和qwen_vl_utils调用聊天模型的代码示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用flash_attention_2以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-32B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是4 - 16384
# 你可以根据需要设置min_pixels和max_pixels，例如令牌范围为256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频URL和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频URL兼容性在很大程度上取决于第三方库的版本，详情如下表所示。如果你不想使用默认的后端，可以通过FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord来更改。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

强烈建议用户（特别是中国大陆的用户）使用ModelScope。snapshot_download可以帮助你解决下载检查点的问题。

✨ 主要特性

核心增强功能

视觉理解：Qwen2.5-VL不仅擅长识别常见物体，如花鸟鱼虫，还能高度有效地分析图像中的文本、图表、图标、图形和布局。
代理能力：Qwen2.5-VL可直接作为视觉代理，能够进行推理并动态指挥工具，具备计算机和手机使用能力。
长视频理解和事件捕捉：Qwen2.5-VL可以理解超过1小时的视频，并且这次具备了通过精确确定相关视频片段来捕捉事件的新能力。
多格式视觉定位：Qwen2.5-VL可以通过生成边界框或点来准确地定位图像中的物体，并能为坐标和属性提供稳定的JSON输出。
结构化输出生成：对于发票、表单、表格等数据的扫描件，Qwen2.5-VL支持生成其内容的结构化输出，有利于金融、商业等领域的应用。

模型架构更新

用于视频理解的动态分辨率和帧率训练

我们通过采用动态FPS采样将动态分辨率扩展到时间维度，使模型能够理解各种采样率的视频。相应地，我们在时间维度上使用ID和绝对时间对齐更新了mRoPE，使模型能够学习时间序列和速度，并最终获得精确确定特定时刻的能力。

精简高效的视觉编码器

我们通过策略性地将窗口注意力机制引入ViT，同时提升了训练和推理速度。ViT架构还通过SwiGLU和RMSNorm进一步优化，使其与Qwen2.5 LLM的结构保持一致。

我们有参数为30亿、70亿和720亿的三种模型。本仓库包含经过指令微调的32B Qwen2.5-VL模型。更多信息，请访问我们的博客和GitHub。

📚 详细文档

评估

视觉评估

数据集	Qwen2.5-VL-72B ^(🤗🤖)	Qwen2-VL-72B ^(🤗🤖)	Qwen2.5-VL-32B ^(🤗🤖)
MMMU	70.2	64.5	70
MMMU Pro	51.1	46.2	49.5
MMStar	70.8	68.3	69.5
MathVista	74.8	70.5	74.7
MathVision	38.1	25.9	40.0
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	-	49.00
CharadesSTA	50.9	-	54.2
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93
AITZ	83.2	-	83.1
Android Control	67.4/93.7	66.4/84.4	69.6/93.3
ScreenSpot	87.1	-	88.5
ScreenSpot Pro	43.6	-	39.4
AndroidWorld	35	-	22.0
OSWorld	8.83	-	5.92

文本评估

模型	MMLU	MMLU-PRO	MATH	GPQA-diamond	MBPP	Human Eval
Qwen2.5-VL-32B	78.4	68.8	82.2	46.0	84.0	91.5
Mistral-Small-3.1-24B	80.6	66.8	69.3	46.0	74.7	88.4
Gemma3-27B-IT	76.9	67.5	89	42.4	74.4	87.8
GPT-4o-Mini	82.0	61.7	70.2	39.4	84.8	87.2
Claude-3.5-Haiku	77.6	65.0	69.2	41.6	85.6	88.1

🔧 技术细节

模型相关信息

属性	详情
模型类型	多模态问答模型
基础模型	deepseek-ai/DeepSeek-V3-0324、sesame/csm-1b、Qwen/QwQ-32B、deepseek-ai/DeepSeek-R1、ds4sd/SmolDocling-256M-preview、mistralai/Mistral-Small-3.1-24B-Instruct-2503
训练数据集	nvidia/Llama-Nemotron-Post-Training-Dataset-v1、FreedomIntelligence/medical-o1-reasoning-SFT、facebook/natural_reasoning、glaiveai/reasoning-v1-20m
评估指标	accuracy、bertscore、code_eval

长文本处理技术

当前的config.json设置为支持最多32,768个令牌的上下文长度。为了处理超过32,768个令牌的大量输入，我们采用了YaRN技术，这是一种增强模型长度外推能力的技术，可确保在长文本上的最佳性能。不过，这种方法对时间和空间定位任务的性能有显著影响，因此不建议使用。同时，对于长视频输入，由于MRoPE本身在ids使用上更节省，因此可以直接将max_position_embeddings修改为更大的值，例如64k。

📄 许可证

本项目采用Apache-2.0许可证。

📖 引用

如果您觉得我们的工作有帮助，请引用以下内容：

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}