Qwen2.5-VL-72B-Instruct开源视觉语言模型 - 支持多领域视觉理解与视频分析

首页

Qwen2.5 VL 72B Instruct GGUF

由 unsloth 开发

Qwen2.5-VL-72B-Instruct是Qwen家族的最新视觉语言模型，具备强大的视觉理解和视频分析能力，适用于金融、商业等多个领域。

文本生成图像

Transformers

英语开源协议:其他 #多模态视觉理解 #长视频事件捕捉 #金融商业结构化输出

下载量 3,285

发布时间 : 5/11/2025

模型简介

Qwen2.5-VL-72B-Instruct是一款先进的视觉语言模型，擅长视觉理解、视频分析和智能代理任务，支持多图像和视频输入，能广泛应用于多种场景。

模型特点

强大的视觉理解能力

不仅能识别常见物体，还能高度准确地分析图像中的文本、图表、图标、图形和布局。

智能代理能力

可直接作为视觉代理，能够进行推理并动态调用工具，具备计算机和手机使用能力。

长视频理解

可以理解超过1小时的视频，并能精确确定相关视频片段来捕捉事件。

视觉定位支持

通过生成边界框或点来准确定位图像中的物体，并能为坐标和属性提供稳定的JSON输出。

结构化输出

对于发票、表单、表格等扫描数据，支持对其内容进行结构化输出，有利于金融、商业等领域的应用。

模型能力

图像描述

视频分析

视觉定位

结构化数据提取

多图像推理

批量推理

长文本处理

使用案例

金融

发票处理

自动识别和提取发票中的结构化数据

高效准确的财务数据处理

商业

图表分析

自动分析商业报告中的图表数据

快速获取商业洞察

视频分析

视频内容理解

分析长视频内容并提取关键事件

高效视频内容检索

🚀 Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct是Qwen家族的最新成员，它是一款强大的视觉语言模型。该模型在视觉理解、视频分析、智能代理等方面具有显著优势，能广泛应用于金融、商业等领域，为用户提供更智能、高效的服务。

🚀 快速开始

Qwen2.5-VL的代码已集成在最新的Hugging face transformers中，建议使用以下命令从源代码进行构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

我们提供了一个工具包，可帮助你更方便地处理各种类型的视觉输入，就像使用API一样。可以使用以下命令进行安装：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

如果你不使用Linux系统，可能无法从PyPI安装decord。在这种情况下，你可以使用pip install qwen-vl-utils，它将回退到使用torchvision进行视频处理。不过，你仍然可以从源代码安装decord，以便在加载视频时使用decord。

使用 🤗 Transformers 进行对话

以下是一个代码片段，展示了如何使用transformers和qwen_vl_utils来使用对话模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 你可以根据需要设置 min_pixels 和 max_pixels，例如令牌范围为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频URL和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频URL的兼容性在很大程度上取决于第三方库的版本。详细信息如下表所示。如果你不想使用默认的后端，可以通过FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord来更改后端。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

强烈建议用户（特别是中国大陆的用户）使用ModelScope。snapshot_download可以帮助你解决下载检查点的问题。

处理长文本

当前的config.json设置的上下文长度最大为32,768个令牌。为了处理超过32,768个令牌的大量输入，我们使用了YaRN技术，该技术用于增强模型的长度外推能力，确保在长文本上的最佳性能。对于支持的框架，你可以在config.json中添加以下内容以启用YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

需要注意的是，这种方法对时间和空间定位任务的性能有显著影响，因此不建议使用。同时，对于长视频输入，由于MRoPE本身在ids方面更节省，因此可以直接将max_position_embeddings修改为更大的值，例如64k。

✨ 主要特性

关键增强功能

视觉理解能力强：Qwen2.5-VL不仅擅长识别常见的物体，如花鸟鱼虫，还能高度准确地分析图像中的文本、图表、图标、图形和布局。
具备智能代理能力：Qwen2.5-VL可直接作为视觉代理，能够进行推理并动态调用工具，具备计算机和手机使用能力。
理解长视频并捕捉事件：Qwen2.5-VL可以理解超过1小时的视频，并且此次新增了通过精确确定相关视频片段来捕捉事件的能力。
支持不同格式的视觉定位：Qwen2.5-VL可以通过生成边界框或点来准确地定位图像中的物体，并能为坐标和属性提供稳定的JSON输出。
生成结构化输出：对于发票、表单、表格等扫描数据，Qwen2.5-VL支持对其内容进行结构化输出，有利于金融、商业等领域的应用。

模型架构更新

用于视频理解的动态分辨率和帧率训练：通过采用动态FPS采样，将动态分辨率扩展到时间维度，使模型能够理解不同采样率的视频。相应地，在时间维度上使用ID和绝对时间对齐更新mRoPE，使模型能够学习时间序列和速度，最终获得精确确定特定时刻的能力。
精简高效的视觉编码器：通过策略性地将窗口注意力机制应用于ViT，提高了训练和推理速度。同时，使用SwiGLU和RMSNorm进一步优化了ViT架构，使其与Qwen2.5 LLM的结构保持一致。

我们有参数为30亿、70亿和720亿的三种模型。本仓库包含经过指令微调的72B Qwen2.5-VL模型。更多信息，请访问我们的博客和GitHub。

📚 详细文档

评估

图像基准测试

基准测试	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

视频基准测试

基准测试	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M-Avg_dev	64.6	-	75.7		74.6
TempCompass_overall	73.8	-	-		74.8

代理基准测试

基准测试	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83