Qwen2.5-VL-72B-Instruct-Pointer-AWQ开源视觉语言模型

首页

Qwen2.5 VL 72B Instruct Pointer AWQ

由 PointerHQ 开发

Qwen2.5-VL是Qwen家族的最新视觉语言模型，具备增强的视觉理解、代理能力和结构化输出生成功能。

图像生成文本

Transformers

英语开源协议:其他 #多模态视频理解 #视觉代理工具调用 #动态分辨率处理

下载量 5,592

发布时间 : 2/9/2025

模型简介

Qwen2.5-VL是一个多模态视觉语言模型，擅长图像文本到文本任务，支持视觉定位、长视频理解和结构化输出生成。

模型特点

增强的视觉理解能力

不仅能识别常见物体，还能高度分析图像中的文本、图表、图标、图形和布局。

代理能力

可直接作为视觉代理，进行推理并动态调用工具，具备计算机和手机使用能力。

长视频理解与事件捕捉

能理解超过1小时的视频，并新增了通过精确定位相关视频片段捕捉事件的能力。

多种格式的视觉定位

能通过生成边界框或点准确在图像中定位对象，并能稳定输出坐标和属性的JSON格式。

结构化输出生成

对于发票、表格等数据扫描件，支持其内容的结构化输出，有利于金融、商业等领域的应用。

模型能力

图像文本理解

视觉定位

长视频分析

结构化数据提取

多模态推理

工具调用

使用案例

商业与金融

发票处理

自动提取发票中的结构化数据

提高财务处理效率

表格分析

解析扫描文档中的表格数据

简化数据录入流程

视频分析

长视频理解

分析超过1小时的视频内容

精确定位特定事件片段

视觉代理

计算机操作

通过视觉理解指导计算机操作

自动化工作流程

🚀 Qwen2.5-VL-72B-Instruct-Pointer-AWQ

由于官方的 Qwen/Qwen2.5-VL-72B-Instruct-AWQ 目前在 vllm 上还不支持张量并行，本模型解决了该问题，支持使用 2、4 或 8 个 GPU 进行 --tensor-parallel 操作。请使用 vllm==0.7.3。

🚀 快速开始

下面为你提供简单示例，展示如何结合 🤖 ModelScope 和 🤗 Transformers 使用 Qwen2.5-VL。

Qwen2.5-VL 的代码已集成到最新的 Hugging face transformers 中，建议你使用以下命令从源代码进行构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则，你可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

我们提供了一个工具包，能让你更便捷地处理各类视觉输入，就像使用 API 一样。该工具包支持 base64、URL 以及图像和视频的交错输入。你可以使用以下命令进行安装：

# 强烈建议使用 `[decord]` 功能以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

如果你使用的不是 Linux 系统，可能无法从 PyPI 安装 decord。这种情况下，你可以使用 pip install qwen-vl-utils，它会回退到使用 torchvision 进行视频处理。不过，你仍然可以从源代码安装 decord，以便在加载视频时使用 decord。

使用 🤗 Transformers 进行对话

以下是一段代码示例，展示如何结合 transformers 和 qwen_vl_utils 使用对话模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 你可以根据需求设置 min_pixels 和 max_pixels，例如将令牌范围设置为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多张图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频 URL 和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频 URL 的兼容性在很大程度上取决于第三方库的版本。具体细节如下表所示。如果你不想使用默认的后端，可以通过 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 来更改。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

我们强烈建议用户（尤其是中国大陆的用户）使用 ModelScope。snapshot_download 可以帮助你解决下载检查点时遇到的问题。

处理长文本

当前的 config.json 设置的上下文长度最大为 32,768 个令牌。为了处理超过 32,768 个令牌的大量输入，我们采用了 YaRN 技术，该技术可增强模型的长度外推能力，确保在处理长文本时达到最佳性能。对于支持的框架，你可以在 config.json 中添加以下内容以启用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

不过，需要注意的是，这种方法会对时间和空间定位任务的性能产生显著影响，因此不建议使用。同时，对于长视频输入，由于 MRoPE 本身在处理 id 时更节省资源，可以直接将 max_position_embeddings 修改为更大的值，例如 64k。

✨ 主要特性

在 Qwen2-VL 发布后的五个月里，众多开发者基于 Qwen2-VL 视觉语言模型构建了新的模型，并为我们提供了宝贵的反馈。在此期间，我们专注于构建更实用的视觉语言模型。如今，我们很高兴地推出 Qwen 家族的最新成员：Qwen2.5-VL。

关键增强功能：

视觉理解能力：Qwen2.5-VL 不仅擅长识别花卉、鸟类、鱼类和昆虫等常见物体，还具备强大的图像文本、图表、图标、图形和布局分析能力。
智能代理能力：Qwen2.5-VL 可直接作为视觉代理，能够进行推理并动态调用工具，支持计算机和手机的使用场景。
长视频理解与事件捕捉：Qwen2.5-VL 能够理解时长超过 1 小时的视频，并且新增了通过定位相关视频片段来捕捉事件的能力。
多格式视觉定位：Qwen2.5-VL 可以通过生成边界框或点来精确地定位图像中的物体，并能为坐标和属性提供稳定的 JSON 输出。
结构化输出生成：对于发票、表单、表格等扫描数据，Qwen2.5-VL 支持生成其内容的结构化输出，有助于金融、商业等领域的应用。

模型架构更新：

视频理解的动态分辨率和帧率训练：我们通过采用动态 FPS 采样将动态分辨率扩展到时间维度，使模型能够理解不同采样率的视频。相应地，我们在时间维度上使用 ID 和绝对时间对齐更新了 mRoPE，使模型能够学习时间序列和速度，最终获得定位特定时刻的能力。

精简高效的视觉编码器：我们通过在 ViT 中策略性地实现窗口注意力，提高了训练和推理速度。ViT 架构还通过 SwiGLU 和 RMSNorm 进一步优化，使其与 Qwen2.5 LLM 的结构保持一致。

我们有参数规模分别为 30 亿、70 亿和 720 亿的三种模型。本仓库包含经过指令微调的 720 亿参数的 Qwen2.5-VL 模型。更多信息，请访问我们的博客和GitHub。

📦 安装指南

Qwen2.5-VL 的代码已集成到最新的 Hugging face transformers 中，建议你使用以下命令从源代码进行构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则，你可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

# 强烈建议使用 `[decord]` 功能以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

📚 详细文档

评估

图像基准测试

基准测试	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

视频基准测试

基准测试	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1		41.3	47.3
EgoSchema	72.2	71.2		77.9	76.2
PerceptionTest_test				68.0	73.2
MLVU_M-Avg_dev	64.6		75.7		74.6
TempCompass_overall	73.8				74.8

代理基准测试

基准测试	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

🔧 技术细节

文档中未提供相关技术细节。

📄 许可证

本项目采用 qwen 许可证。

📚 引用

如果你觉得我们的工作有帮助，请引用以下文献：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}