Qwen2.5 VL 32B Instruct GGUF

由unsloth開發

Qwen2.5-VL-32B-Instruct 是一個強大的視覺語言模型，具備增強的數學和問題解決能力，適用於多模態任務。

圖像生成文本英語開源協議:Apache-2.0 #多模態視頻理解 #動態視覺定位 #結構化數據提取

下載量 464

發布時間 : 5/11/2025

模型概述

Qwen2.5-VL-32B-Instruct 是一個經過指令調優的視覺語言模型，擅長圖像分析、文本理解、圖表解析和視頻理解，支持多種格式的視覺定位和結構化輸出。

模型特點

增強的視覺理解能力

能夠高效分析圖像中的文本、圖表、圖標、圖形和佈局。

代理能力

可作為視覺代理，動態調用工具並具備計算機和手機使用能力。

長視頻理解

能夠理解超過1小時的視頻，並精確定位相關視頻片段。

視覺定位

支持生成邊界框或點來精確定位圖像中的對象，並能穩定輸出座標和屬性的JSON格式。

結構化輸出

支持發票掃描件、表格等數據的結構化輸出，適用於金融、商業等領域。

模型能力

圖像分析

文本理解

圖表解析

視頻理解

視覺定位

結構化輸出

工具調用

使用案例

金融

發票處理

自動解析發票內容並生成結構化數據。

提高數據處理效率和準確性。

商業

表格解析

從掃描的表格中提取結構化信息。

簡化數據錄入流程。

教育

圖表理解

解析教育材料中的圖表和圖形。

輔助學習和教學。

🚀 Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct 是一款多模態模型，在圖像和文本處理方面表現出色。它不僅具備強大的視覺理解能力，還能處理複雜的數學和問題解決任務，為用戶提供更優質的交互體驗。

✨ 主要特性

核心增強功能

視覺理解：Qwen2.5-VL 不僅擅長識別常見物體，如花鳥魚蟲，還能高度分析圖像中的文本、圖表、圖標、圖形和佈局。
智能代理：Qwen2.5-VL 可直接作為視覺代理，進行推理並動態指導工具，具備計算機和手機使用能力。
長視頻理解與事件捕捉：Qwen2.5-VL 能夠理解超過 1 小時的視頻，並且具備通過精確相關視頻片段捕捉事件的新能力。
不同格式的視覺定位：Qwen2.5-VL 可以通過生成邊界框或點來準確地定位圖像中的對象，並能為座標和屬性提供穩定的 JSON 輸出。
結構化輸出生成：對於發票、表單、表格等掃描數據，Qwen2.5-VL 支持其內容的結構化輸出，有利於金融、商業等領域的應用。

模型架構更新

用於視頻理解的動態分辨率和幀率訓練：我們通過採用動態 FPS 採樣將動態分辨率擴展到時間維度，使模型能夠理解各種採樣率的視頻。相應地，我們在時間維度上使用 ID 和絕對時間對齊更新了 mRoPE，使模型能夠學習時間序列和速度，並最終獲得精確特定時刻的能力。
精簡高效的視覺編碼器：我們通過策略性地將窗口注意力機制引入 ViT，提高了訓練和推理速度。ViT 架構還通過 SwiGLU 和 RMSNorm 進一步優化，使其與 Qwen2.5 LLM 的結構保持一致。

我們有四個參數分別為 30 億、70 億、320 億和 720 億的模型。本倉庫包含經過指令微調的 32B Qwen2.5-VL 模型。更多信息，請訪問我們的博客和 GitHub。

📚 詳細文檔

評估

視覺評估

數據集	Qwen2.5-VL-72B ^(🤗🤖)	Qwen2-VL-72B ^(🤗🤖)	Qwen2.5-VL-32B ^(🤗🤖)
MMMU	70.2	64.5	70
MMMU Pro	51.1	46.2	49.5
MMStar	70.8	68.3	69.5
MathVista	74.8	70.5	74.7
MathVision	38.1	25.9	40.0
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	-	49.00
CharadesSTA	50.9	-	54.2
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93
AITZ	83.2	-	83.1
Android Control	67.4/93.7	66.4/84.4	69.6/93.3
ScreenSpot	87.1	-	88.5
ScreenSpot Pro	43.6	-	39.4
AndroidWorld	35	-	22.0
OSWorld	8.83	-	5.92

文本評估

模型	MMLU	MMLU-PRO	MATH	GPQA-diamond	MBPP	Human Eval
Qwen2.5-VL-32B	78.4	68.8	82.2	46.0	84.0	91.5
Mistral-Small-3.1-24B	80.6	66.8	69.3	46.0	74.7	88.4
Gemma3-27B-IT	76.9	67.5	89	42.4	74.4	87.8
GPT-4o-Mini	82.0	61.7	70.2	39.4	84.8	87.2
Claude-3.5-Haiku	77.6	65.0	69.2	41.6	85.6	88.1

輸入要求

Qwen2.5-VL 的代碼已集成到最新的 Hugging face transformers 中，我們建議您使用以下命令從源代碼進行構建：

pip install git+https://github.com/huggingface/transformers accelerate

否則，您可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

我們提供了一個工具包，可幫助您更方便地處理各種類型的視覺輸入，就像使用 API 一樣。這包括 base64、URL 以及交錯的圖像和視頻。您可以使用以下命令進行安裝：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度。
pip install qwen-vl-utils[decord]==0.0.8

如果您不使用 Linux 系統，可能無法從 PyPI 安裝 decord。在這種情況下，您可以使用 pip install qwen-vl-utils，它將回退到使用 torchvision 進行視頻處理。不過，您仍然可以從源代碼安裝 decord，以便在加載視頻時使用 decord。

圖像分辨率以提升性能

模型支持廣泛的分辨率輸入。默認情況下，它使用原生分辨率進行輸入，但更高的分辨率可以提升性能，但會增加計算量。用戶可以設置最小和最大像素數，以實現滿足自身需求的最佳配置，例如將令牌數量範圍設置為 256 - 1280，以平衡速度和內存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我們提供了兩種方法來對輸入到模型的圖像大小進行細粒度控制：

定義 min_pixels 和 max_pixels：圖像將被調整大小，以在 min_pixels 和 max_pixels 範圍內保持其縱橫比。
指定確切的尺寸：直接設置 resized_height 和 resized_width。這些值將被四捨五入到最接近的 28 的倍數。

# min_pixels 和 max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height 和 resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

處理長文本

當前的 config.json 設置的上下文長度最大為 32,768 個令牌。為了處理超過 32,768 個令牌的大量輸入，我們採用了 YaRN 技術，這是一種增強模型長度外推能力的技術，確保在長文本上的最佳性能。對於支持的框架，您可以在 config.json 中添加以下內容以啟用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

然而，需要注意的是，這種方法對時間和空間定位任務的性能有顯著影響，因此不建議使用。同時，對於長視頻輸入，由於 MRoPE 本身在使用 ID 方面更經濟，因此可以直接將 max_position_embeddings 修改為更大的值，例如 64k。

💻 使用示例

使用 🤗 Transformers 進行聊天

以下是一個代碼片段，展示瞭如何使用 transformers 和 qwen_vl_utils 來使用聊天模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", torch_dtype="auto", device_map="auto"
)

# 我們建議啟用 flash_attention_2 以獲得更好的加速和內存節省，特別是在多圖像和視頻場景中。
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-32B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct")

# 模型中每張圖像的視覺令牌數量的默認範圍是 4 - 16384。
# 您可以根據需要設置 min_pixels 和 max_pixels，例如將令牌範圍設置為 256 - 1280，以平衡性能和成本。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

```python # 包含多個圖像和一個文本查詢的消息 messages = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "Identify the similarities between these images."}, ], } ]

推理準備

text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

推理

generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)

</details>

<details>
<summary>視頻推理</summary>
```python
# 包含圖像列表作為視頻和一個文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和一個文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻 URL 和一個文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，幀率信息也會輸入到模型中以與絕對時間對齊。
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻 URL 兼容性在很大程度上取決於第三方庫的版本。詳細信息如下表所示。如果您不想使用默認的後端，可以通過 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 來更改後端。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

```python # 批量推理的示例消息 messages1 = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "What are the common elements in these pictures?"}, ], } ] messages2 = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who are you?"}, ] # 合併消息以進行批量處理 messages = [messages1, messages2]

批量推理準備

texts = [ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

批量推理

generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_texts = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_texts)

</details>

### 🤖 ModelScope
我們強烈建議用戶（特別是中國大陸的用戶）使用 ModelScope。`snapshot_download` 可以幫助您解決下載檢查點的問題。

### 更多使用提示
對於輸入圖像，我們支持本地文件、base64 和 URL。對於視頻，目前我們僅支持本地文件。
```python
# 您可以直接將本地文件路徑、URL 或 base64 編碼的圖像插入到文本中的所需位置。
## 本地文件路徑
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## 圖像 URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 編碼的圖像
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

📄 許可證

本項目採用 Apache-2.0 許可證。

📚 引用

如果您覺得我們的工作有幫助，請引用以下內容：

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}