Qwen2.5 VL 32B Instruct GGUF

Developed by unsloth

Qwen2.5-VL-32B-Instruct 是一个强大的视觉语言模型，具备增强的数学和问题解决能力，适用于多模态任务。

图像生成文本 EnglishOpen Source License:Apache-2.0 #多模态视频理解 #动态视觉定位 #结构化数据提取

Downloads 464

Release Time : 5/11/2025

Model Overview

Qwen2.5-VL-32B-Instruct 是一个经过指令调优的视觉语言模型，擅长图像分析、文本理解、图表解析和视频理解，支持多种格式的视觉定位和结构化输出。

Model Features

增强的视觉理解能力

能够高效分析图像中的文本、图表、图标、图形和布局。

代理能力

可作为视觉代理，动态调用工具并具备计算机和手机使用能力。

长视频理解

能够理解超过1小时的视频，并精确定位相关视频片段。

视觉定位

支持生成边界框或点来精确定位图像中的对象，并能稳定输出坐标和属性的JSON格式。

结构化输出

支持发票扫描件、表格等数据的结构化输出，适用于金融、商业等领域。

Model Capabilities

图像分析

文本理解

图表解析

视频理解

视觉定位

结构化输出

工具调用

Use Cases

金融

发票处理

自动解析发票内容并生成结构化数据。

提高数据处理效率和准确性。

商业

表格解析

从扫描的表格中提取结构化信息。

简化数据录入流程。

教育

图表理解

解析教育材料中的图表和图形。

辅助学习和教学。

🚀 Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct 是一款多模态模型，在图像和文本处理方面表现出色。它不仅具备强大的视觉理解能力，还能处理复杂的数学和问题解决任务，为用户提供更优质的交互体验。

✨ 主要特性

核心增强功能

视觉理解：Qwen2.5-VL 不仅擅长识别常见物体，如花鸟鱼虫，还能高度分析图像中的文本、图表、图标、图形和布局。
智能代理：Qwen2.5-VL 可直接作为视觉代理，进行推理并动态指导工具，具备计算机和手机使用能力。
长视频理解与事件捕捉：Qwen2.5-VL 能够理解超过 1 小时的视频，并且具备通过精确相关视频片段捕捉事件的新能力。
不同格式的视觉定位：Qwen2.5-VL 可以通过生成边界框或点来准确地定位图像中的对象，并能为坐标和属性提供稳定的 JSON 输出。
结构化输出生成：对于发票、表单、表格等扫描数据，Qwen2.5-VL 支持其内容的结构化输出，有利于金融、商业等领域的应用。

模型架构更新

用于视频理解的动态分辨率和帧率训练：我们通过采用动态 FPS 采样将动态分辨率扩展到时间维度，使模型能够理解各种采样率的视频。相应地，我们在时间维度上使用 ID 和绝对时间对齐更新了 mRoPE，使模型能够学习时间序列和速度，并最终获得精确特定时刻的能力。
精简高效的视觉编码器：我们通过策略性地将窗口注意力机制引入 ViT，提高了训练和推理速度。ViT 架构还通过 SwiGLU 和 RMSNorm 进一步优化，使其与 Qwen2.5 LLM 的结构保持一致。

我们有四个参数分别为 30 亿、70 亿、320 亿和 720 亿的模型。本仓库包含经过指令微调的 32B Qwen2.5-VL 模型。更多信息，请访问我们的博客和 GitHub。

📚 详细文档

评估

视觉评估

数据集	Qwen2.5-VL-72B ^(🤗🤖)	Qwen2-VL-72B ^(🤗🤖)	Qwen2.5-VL-32B ^(🤗🤖)
MMMU	70.2	64.5	70
MMMU Pro	51.1	46.2	49.5
MMStar	70.8	68.3	69.5
MathVista	74.8	70.5	74.7
MathVision	38.1	25.9	40.0
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	-	49.00
CharadesSTA	50.9	-	54.2
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93
AITZ	83.2	-	83.1
Android Control	67.4/93.7	66.4/84.4	69.6/93.3
ScreenSpot	87.1	-	88.5
ScreenSpot Pro	43.6	-	39.4
AndroidWorld	35	-	22.0
OSWorld	8.83	-	5.92

文本评估

模型	MMLU	MMLU-PRO	MATH	GPQA-diamond	MBPP	Human Eval
Qwen2.5-VL-32B	78.4	68.8	82.2	46.0	84.0	91.5
Mistral-Small-3.1-24B	80.6	66.8	69.3	46.0	74.7	88.4
Gemma3-27B-IT	76.9	67.5	89	42.4	74.4	87.8
GPT-4o-Mini	82.0	61.7	70.2	39.4	84.8	87.2
Claude-3.5-Haiku	77.6	65.0	69.2	41.6	85.6	88.1

输入要求

Qwen2.5-VL 的代码已集成到最新的 Hugging face transformers 中，我们建议您使用以下命令从源代码进行构建：

pip install git+https://github.com/huggingface/transformers accelerate

否则，您可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

我们提供了一个工具包，可帮助您更方便地处理各种类型的视觉输入，就像使用 API 一样。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令进行安装：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度。
pip install qwen-vl-utils[decord]==0.0.8

如果您不使用 Linux 系统，可能无法从 PyPI 安装 decord。在这种情况下，您可以使用 pip install qwen-vl-utils，它将回退到使用 torchvision 进行视频处理。不过，您仍然可以从源代码安装 decord，以便在加载视频时使用 decord。

图像分辨率以提升性能

模型支持广泛的分辨率输入。默认情况下，它使用原生分辨率进行输入，但更高的分辨率可以提升性能，但会增加计算量。用户可以设置最小和最大像素数，以实现满足自身需求的最佳配置，例如将令牌数量范围设置为 256 - 1280，以平衡速度和内存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我们提供了两种方法来对输入到模型的图像大小进行细粒度控制：

定义 min_pixels 和 max_pixels：图像将被调整大小，以在 min_pixels 和 max_pixels 范围内保持其纵横比。
指定确切的尺寸：直接设置 resized_height 和 resized_width。这些值将被四舍五入到最接近的 28 的倍数。

# min_pixels 和 max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height 和 resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

处理长文本

当前的 config.json 设置的上下文长度最大为 32,768 个令牌。为了处理超过 32,768 个令牌的大量输入，我们采用了 YaRN 技术，这是一种增强模型长度外推能力的技术，确保在长文本上的最佳性能。对于支持的框架，您可以在 config.json 中添加以下内容以启用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

然而，需要注意的是，这种方法对时间和空间定位任务的性能有显著影响，因此不建议使用。同时，对于长视频输入，由于 MRoPE 本身在使用 ID 方面更经济，因此可以直接将 max_position_embeddings 修改为更大的值，例如 64k。

💻 使用示例

使用 🤗 Transformers 进行聊天

以下是一个代码片段，展示了如何使用 transformers 和 qwen_vl_utils 来使用聊天模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", torch_dtype="auto", device_map="auto"
)

# 我们建议启用 flash_attention_2 以获得更好的加速和内存节省，特别是在多图像和视频场景中。
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-32B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct")

# 模型中每张图像的视觉令牌数量的默认范围是 4 - 16384。
# 您可以根据需要设置 min_pixels 和 max_pixels，例如将令牌范围设置为 256 - 1280，以平衡性能和成本。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

```python # 包含多个图像和一个文本查询的消息 messages = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "Identify the similarities between these images."}, ], } ]

推理准备

text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

推理

generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)

</details>

<details>
<summary>视频推理</summary>
```python
# 包含图像列表作为视频和一个文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和一个文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频 URL 和一个文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐。
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频 URL 兼容性在很大程度上取决于第三方库的版本。详细信息如下表所示。如果您不想使用默认的后端，可以通过 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 来更改后端。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

```python # 批量推理的示例消息 messages1 = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "What are the common elements in these pictures?"}, ], } ] messages2 = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who are you?"}, ] # 合并消息以进行批量处理 messages = [messages1, messages2]

批量推理准备

texts = [ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

批量推理

generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_texts = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_texts)

</details>

### 🤖 ModelScope
我们强烈建议用户（特别是中国大陆的用户）使用 ModelScope。`snapshot_download` 可以帮助您解决下载检查点的问题。

### 更多使用提示
对于输入图像，我们支持本地文件、base64 和 URL。对于视频，目前我们仅支持本地文件。
```python
# 您可以直接将本地文件路径、URL 或 base64 编码的图像插入到文本中的所需位置。
## 本地文件路径
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## 图像 URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 编码的图像
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

📄 许可证

本项目采用 Apache-2.0 许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用以下内容：

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}