Qwen2.5-VL-72B-Instruct-AWQ開源多模態模型 - 支持多格式輸入，視覺理解強

首頁

Qwen2.5 VL 72B Instruct AWQ

由Benasd開發

Qwen2.5-VL是通義千問團隊推出的多模態大語言模型，具備強大的視覺理解和智能代理能力，支持圖像、視頻、文本等多種輸入格式。

文本生成圖像

Transformers

英語開源協議:其他 #多模態視覺理解 #長視頻分析 #智能代理控制

下載量 173

發布時間 : 2/13/2025

模型概述

Qwen2.5-VL是通義千問系列的最新視覺語言模型，專注於提升視覺理解、智能代理和結構化輸出能力，適用於金融、商業等多個領域。

模型特點

增強視覺理解

精準分析圖像中的文本、圖表、圖標、圖形和佈局，超越常見物體識別

智能代理能力

可直接作為視覺代理進行推理並動態調用工具，具備計算機和手機操作能力

長視頻理解

可理解超過1小時的視頻內容，新增精準定位相關視頻片段的事件捕捉能力

多格式視覺定位

通過生成邊界框或點座標精確定位圖像中的物體，穩定輸出JSON格式數據

結構化輸出

支持發票、表格等數據的結構化內容輸出，適用於金融、商業等領域

模型能力

圖像理解

視頻理解

文本識別

圖表分析

智能代理

視覺定位

結構化數據提取

使用案例

商業分析

發票處理

自動識別和提取發票中的關鍵信息

實現財務數據自動化錄入

商業報告分析

解析商業報告中的圖表和數據

快速生成業務洞察

智能代理

手機操作自動化

通過視覺指令控制手機應用

實現自動化測試和操作

教育

數學題目解答

解析包含圖表和公式的數學題目

提供分步解答過程

🚀 Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct是Qwen家族最新的多模態模型，具備強大的圖像、視頻理解能力以及視覺代理能力。它能分析圖像中的文本、圖表等元素，理解長視頻並捕捉事件，還可進行視覺定位和生成結構化輸出，為多模態應用提供了有力支持。

🚀 快速開始

多GPU推理

使用以下docker命令進行多GPU推理：

docker run -it --name iddt-ben-qwen25vl72 --gpus '"device=0,1"' -v huggingface:/root/.cache/huggingface --shm-size=32g -p 30000:8000 --ipc=host benasd/vllm:latest --model Benasd/Qwen2.5-VL-72B-Instruct-AWQ  --dtype float16 --quantization awq -tp 2

安裝依賴

Qwen2.5-VL的代碼已集成在最新的Hugging face transformers庫中，建議使用以下命令從源代碼構建：

pip install git+https://github.com/huggingface/transformers accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

使用工具包

為了更方便地處理各種類型的視覺輸入，可安裝以下工具包：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

如果不使用Linux系統，可能無法從PyPI安裝decord。此時，可以使用pip install qwen-vl-utils，它將回退到使用torchvision進行視頻處理。不過，仍然可以從源代碼安裝decord，以便在加載視頻時使用decord。

使用🤗 Transformers進行對話

以下是使用transformers和qwen_vl_utils進行對話的代碼示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是 4 - 16384
# 可以根據需要設置 min_pixels 和 max_pixels，例如令牌範圍為 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

# 包含多張圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻 URL 和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻 URL 兼容性在很大程度上取決於第三方庫的版本。詳情如下表所示。如果不想使用默認的後端，可以通過FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord來更改。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 使用ModelScope

強烈建議用戶（特別是中國大陸的用戶）使用ModelScope。snapshot_download可以幫助解決下載檢查點時遇到的問題。

✨ 主要特性

關鍵增強功能

視覺理解：Qwen2.5-VL不僅擅長識別常見物體，如花鳥魚蟲，還能對圖像中的文本、圖表、圖標、圖形和佈局進行深度分析。
智能代理：可直接作為視覺代理，具備推理能力並能動態調用工具，支持計算機和手機的使用場景。
長視頻理解與事件捕捉：能夠理解超過1小時的視頻，並具備定位相關視頻片段以捕捉事件的新能力。
多格式視覺定位：可以通過生成邊界框或點的方式在圖像中準確定位物體，併為座標和屬性提供穩定的JSON輸出。
結構化輸出生成：對於發票、表單、表格等掃描數據，支持生成其內容的結構化輸出，適用於金融、商業等領域。

模型架構更新

動態分辨率和幀率訓練以支持視頻理解：通過採用動態FPS採樣將動態分辨率擴展到時間維度，使模型能夠理解不同採樣率的視頻。同時，在時間維度上使用ID和絕對時間對齊更新mRoPE，讓模型學習時間序列和速度，最終獲得定位特定時刻的能力。
精簡高效的視覺編碼器：將窗口注意力策略性地應用於ViT，提高了訓練和推理速度。同時，使用SwiGLU和RMSNorm進一步優化ViT架構，使其與Qwen2.5 LLM的結構保持一致。

目前有參數分別為30億、70億和720億的三個模型。本倉庫包含經過指令微調的72B Qwen2.5-VL模型。更多信息請訪問博客和GitHub。

📚 詳細文檔

評估指標

圖像基準測試

基準測試	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

視頻基準測試

基準測試	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M-Avg_dev	64.6	-	75.7		74.6
TempCompass_overall	73.8	-	-		74.8

代理基準測試

基準測試	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

🔧 技術細節

模型信息

屬性	詳情
模型類型	多模態圖像文本生成模型
訓練數據	未提及
基礎模型	Qwen/Qwen2.5-VL-72B-Instruct
庫名稱	transformers
管道標籤	image-text-to-text
標籤	multimodal

許可證信息

本項目採用Qwen許可證。

📄 許可證

本項目使用的許可證為Qwen許可證。

📖 引用

如果您覺得我們的工作有幫助，請引用以下文獻：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}