Qwen2.5-VL-32B-Instruct開源視覺語言模型 - 支持多模態任務智能處理

首頁

Space Model

由Alhdrawi開發

Qwen2.5-VL-32B-Instruct是Qwen家族的最新視覺語言模型，具備強大的視覺理解和智能代理能力，支持多模態任務處理。

圖像生成文本

Transformers

支持多種語言開源協議:Apache-2.0 #多模態視覺理解 #長視頻事件定位 #結構化數據輸出

下載量 58

發布時間 : 3/31/2025

模型概述

Qwen2.5-VL-32B-Instruct是一個320億參數的視覺語言模型，專注於提升視覺理解、數學推理和問題解決能力，支持圖像、視頻和文本的多模態交互。

模型特點

增強的視覺理解能力

不僅能識別常見物體，還擅長分析圖像中的文本、圖表、圖標、圖形和佈局。

智能代理能力

可直接作為視覺代理，動態調用工具，支持計算機和手機操作。

長視頻理解與事件捕捉

能解析超過1小時的視頻，新增精準定位相關片段的能力。

多格式視覺定位

通過生成邊界框或點座標精確定位圖像對象，並輸出穩定的JSON格式座標和屬性。

結構化輸出

支持發票、表格等掃描數據的結構化輸出，適用於金融、商業等場景。

模型能力

圖像分析

視頻理解

文本生成

數學推理

邏輯推理

知識問答

視覺定位

智能代理

使用案例

金融與商業

發票處理

自動識別和結構化輸出發票信息

準確率高達96.4%（DocVQA數據集）

教育

數學問題解答

解析和解答包含圖表和公式的數學問題

MathVista數據集得分74.7

視頻分析

長視頻內容理解

解析超過1小時的視頻內容並定位關鍵事件

LVBench得分49.00

🚀 Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct是一款強大的視覺語言模型，它在數學和問題解決能力上表現出色，能處理多種視覺輸入，為用戶提供精準的回答，適用於圖像識別、視頻分析、知識問答等多個領域。

🚀 快速開始

安裝依賴

Qwen2.5-VL的代碼已集成在最新的Hugging face transformers中，建議使用以下命令從源代碼進行構建：

pip install git+https://github.com/huggingface/transformers accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

同時，我們提供了一個工具包，幫助你更方便地處理各種類型的視覺輸入，你可以使用以下命令進行安裝：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

如果你不使用Linux系統，可能無法從PyPI安裝decord，這種情況下可以使用pip install qwen-vl-utils，它將回退到使用torchvision進行視頻處理。不過，你仍然可以從源代碼安裝decord，以便在加載視頻時使用decord。

使用示例

使用🤗 Transformers進行對話

以下是一個使用transformers和qwen_vl_utils調用聊天模型的代碼示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用flash_attention_2以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-32B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是4 - 16384
# 你可以根據需要設置min_pixels和max_pixels，例如令牌範圍為256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻URL和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻URL兼容性在很大程度上取決於第三方庫的版本，詳情如下表所示。如果你不想使用默認的後端，可以通過FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord來更改。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

強烈建議用戶（特別是中國大陸的用戶）使用ModelScope。snapshot_download可以幫助你解決下載檢查點的問題。

✨ 主要特性

核心增強功能

視覺理解：Qwen2.5-VL不僅擅長識別常見物體，如花鳥魚蟲，還能高度有效地分析圖像中的文本、圖表、圖標、圖形和佈局。
代理能力：Qwen2.5-VL可直接作為視覺代理，能夠進行推理並動態指揮工具，具備計算機和手機使用能力。
長視頻理解和事件捕捉：Qwen2.5-VL可以理解超過1小時的視頻，並且這次具備了通過精確確定相關視頻片段來捕捉事件的新能力。
多格式視覺定位：Qwen2.5-VL可以通過生成邊界框或點來準確地定位圖像中的物體，並能為座標和屬性提供穩定的JSON輸出。
結構化輸出生成：對於發票、表單、表格等數據的掃描件，Qwen2.5-VL支持生成其內容的結構化輸出，有利於金融、商業等領域的應用。

模型架構更新

用於視頻理解的動態分辨率和幀率訓練

我們通過採用動態FPS採樣將動態分辨率擴展到時間維度，使模型能夠理解各種採樣率的視頻。相應地，我們在時間維度上使用ID和絕對時間對齊更新了mRoPE，使模型能夠學習時間序列和速度，並最終獲得精確確定特定時刻的能力。

精簡高效的視覺編碼器

我們通過策略性地將窗口注意力機制引入ViT，同時提升了訓練和推理速度。ViT架構還通過SwiGLU和RMSNorm進一步優化，使其與Qwen2.5 LLM的結構保持一致。

我們有參數為30億、70億和720億的三種模型。本倉庫包含經過指令微調的32B Qwen2.5-VL模型。更多信息，請訪問我們的博客和GitHub。

📚 詳細文檔

評估

視覺評估

數據集	Qwen2.5-VL-72B ^(🤗🤖)	Qwen2-VL-72B ^(🤗🤖)	Qwen2.5-VL-32B ^(🤗🤖)
MMMU	70.2	64.5	70
MMMU Pro	51.1	46.2	49.5
MMStar	70.8	68.3	69.5
MathVista	74.8	70.5	74.7
MathVision	38.1	25.9	40.0
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	-	49.00
CharadesSTA	50.9	-	54.2
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93
AITZ	83.2	-	83.1
Android Control	67.4/93.7	66.4/84.4	69.6/93.3
ScreenSpot	87.1	-	88.5
ScreenSpot Pro	43.6	-	39.4
AndroidWorld	35	-	22.0
OSWorld	8.83	-	5.92

文本評估

模型	MMLU	MMLU-PRO	MATH	GPQA-diamond	MBPP	Human Eval
Qwen2.5-VL-32B	78.4	68.8	82.2	46.0	84.0	91.5
Mistral-Small-3.1-24B	80.6	66.8	69.3	46.0	74.7	88.4
Gemma3-27B-IT	76.9	67.5	89	42.4	74.4	87.8
GPT-4o-Mini	82.0	61.7	70.2	39.4	84.8	87.2
Claude-3.5-Haiku	77.6	65.0	69.2	41.6	85.6	88.1

🔧 技術細節

模型相關信息

屬性	詳情
模型類型	多模態問答模型
基礎模型	deepseek-ai/DeepSeek-V3-0324、sesame/csm-1b、Qwen/QwQ-32B、deepseek-ai/DeepSeek-R1、ds4sd/SmolDocling-256M-preview、mistralai/Mistral-Small-3.1-24B-Instruct-2503
訓練數據集	nvidia/Llama-Nemotron-Post-Training-Dataset-v1、FreedomIntelligence/medical-o1-reasoning-SFT、facebook/natural_reasoning、glaiveai/reasoning-v1-20m
評估指標	accuracy、bertscore、code_eval

長文本處理技術

當前的config.json設置為支持最多32,768個令牌的上下文長度。為了處理超過32,768個令牌的大量輸入，我們採用了YaRN技術，這是一種增強模型長度外推能力的技術，可確保在長文本上的最佳性能。不過，這種方法對時間和空間定位任務的性能有顯著影響，因此不建議使用。同時，對於長視頻輸入，由於MRoPE本身在ids使用上更節省，因此可以直接將max_position_embeddings修改為更大的值，例如64k。

📄 許可證

本項目採用Apache-2.0許可證。

📖 引用

如果您覺得我們的工作有幫助，請引用以下內容：

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}