Qwen2.5-VL-72B-Instruct-Pointer-AWQ開源視覺語言模型

首頁

Qwen2.5 VL 72B Instruct Pointer AWQ

由PointerHQ開發

Qwen2.5-VL是Qwen家族的最新視覺語言模型，具備增強的視覺理解、代理能力和結構化輸出生成功能。

圖像生成文本

Transformers

英語開源協議:其他 #多模態視頻理解 #視覺代理工具調用 #動態分辨率處理

下載量 5,592

發布時間 : 2/9/2025

模型概述

Qwen2.5-VL是一個多模態視覺語言模型，擅長圖像文本到文本任務，支持視覺定位、長視頻理解和結構化輸出生成。

模型特點

增強的視覺理解能力

不僅能識別常見物體，還能高度分析圖像中的文本、圖表、圖標、圖形和佈局。

代理能力

可直接作為視覺代理，進行推理並動態調用工具，具備計算機和手機使用能力。

長視頻理解與事件捕捉

能理解超過1小時的視頻，並新增了通過精確定位相關視頻片段捕捉事件的能力。

多種格式的視覺定位

能通過生成邊界框或點準確在圖像中定位對象，並能穩定輸出座標和屬性的JSON格式。

結構化輸出生成

對於發票、表格等數據掃描件，支持其內容的結構化輸出，有利於金融、商業等領域的應用。

模型能力

圖像文本理解

視覺定位

長視頻分析

結構化數據提取

多模態推理

工具調用

使用案例

商業與金融

發票處理

自動提取發票中的結構化數據

提高財務處理效率

表格分析

解析掃描文檔中的表格數據

簡化數據錄入流程

視頻分析

長視頻理解

分析超過1小時的視頻內容

精確定位特定事件片段

視覺代理

計算機操作

通過視覺理解指導計算機操作

自動化工作流程

🚀 Qwen2.5-VL-72B-Instruct-Pointer-AWQ

由於官方的 Qwen/Qwen2.5-VL-72B-Instruct-AWQ 目前在 vllm 上還不支持張量並行，本模型解決了該問題，支持使用 2、4 或 8 個 GPU 進行 --tensor-parallel 操作。請使用 vllm==0.7.3。

🚀 快速開始

下面為你提供簡單示例，展示如何結合 🤖 ModelScope 和 🤗 Transformers 使用 Qwen2.5-VL。

Qwen2.5-VL 的代碼已集成到最新的 Hugging face transformers 中，建議你使用以下命令從源代碼進行構建：

pip install git+https://github.com/huggingface/transformers accelerate

否則，你可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

我們提供了一個工具包，能讓你更便捷地處理各類視覺輸入，就像使用 API 一樣。該工具包支持 base64、URL 以及圖像和視頻的交錯輸入。你可以使用以下命令進行安裝：

# 強烈建議使用 `[decord]` 功能以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

如果你使用的不是 Linux 系統，可能無法從 PyPI 安裝 decord。這種情況下，你可以使用 pip install qwen-vl-utils，它會回退到使用 torchvision 進行視頻處理。不過，你仍然可以從源代碼安裝 decord，以便在加載視頻時使用 decord。

使用 🤗 Transformers 進行對話

以下是一段代碼示例，展示如何結合 transformers 和 qwen_vl_utils 使用對話模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是 4 - 16384
# 你可以根據需求設置 min_pixels 和 max_pixels，例如將令牌範圍設置為 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

# 包含多張圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻 URL 和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻 URL 的兼容性在很大程度上取決於第三方庫的版本。具體細節如下表所示。如果你不想使用默認的後端，可以通過 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 來更改。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

我們強烈建議用戶（尤其是中國大陸的用戶）使用 ModelScope。snapshot_download 可以幫助你解決下載檢查點時遇到的問題。

處理長文本

當前的 config.json 設置的上下文長度最大為 32,768 個令牌。為了處理超過 32,768 個令牌的大量輸入，我們採用了 YaRN 技術，該技術可增強模型的長度外推能力，確保在處理長文本時達到最佳性能。對於支持的框架，你可以在 config.json 中添加以下內容以啟用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

不過，需要注意的是，這種方法會對時間和空間定位任務的性能產生顯著影響，因此不建議使用。同時，對於長視頻輸入，由於 MRoPE 本身在處理 id 時更節省資源，可以直接將 max_position_embeddings 修改為更大的值，例如 64k。

✨ 主要特性

在 Qwen2-VL 發佈後的五個月裡，眾多開發者基於 Qwen2-VL 視覺語言模型構建了新的模型，併為我們提供了寶貴的反饋。在此期間，我們專注於構建更實用的視覺語言模型。如今，我們很高興地推出 Qwen 家族的最新成員：Qwen2.5-VL。

關鍵增強功能：

視覺理解能力：Qwen2.5-VL 不僅擅長識別花卉、鳥類、魚類和昆蟲等常見物體，還具備強大的圖像文本、圖表、圖標、圖形和佈局分析能力。
智能代理能力：Qwen2.5-VL 可直接作為視覺代理，能夠進行推理並動態調用工具，支持計算機和手機的使用場景。
長視頻理解與事件捕捉：Qwen2.5-VL 能夠理解時長超過 1 小時的視頻，並且新增了通過定位相關視頻片段來捕捉事件的能力。
多格式視覺定位：Qwen2.5-VL 可以通過生成邊界框或點來精確地定位圖像中的物體，並能為座標和屬性提供穩定的 JSON 輸出。
結構化輸出生成：對於發票、表單、表格等掃描數據，Qwen2.5-VL 支持生成其內容的結構化輸出，有助於金融、商業等領域的應用。

模型架構更新：

視頻理解的動態分辨率和幀率訓練：我們通過採用動態 FPS 採樣將動態分辨率擴展到時間維度，使模型能夠理解不同採樣率的視頻。相應地，我們在時間維度上使用 ID 和絕對時間對齊更新了 mRoPE，使模型能夠學習時間序列和速度，最終獲得定位特定時刻的能力。

精簡高效的視覺編碼器：我們通過在 ViT 中策略性地實現窗口注意力，提高了訓練和推理速度。ViT 架構還通過 SwiGLU 和 RMSNorm 進一步優化，使其與 Qwen2.5 LLM 的結構保持一致。

我們有參數規模分別為 30 億、70 億和 720 億的三種模型。本倉庫包含經過指令微調的 720 億參數的 Qwen2.5-VL 模型。更多信息，請訪問我們的博客和GitHub。

📦 安裝指南

Qwen2.5-VL 的代碼已集成到最新的 Hugging face transformers 中，建議你使用以下命令從源代碼進行構建：

pip install git+https://github.com/huggingface/transformers accelerate

否則，你可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

# 強烈建議使用 `[decord]` 功能以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

📚 詳細文檔

評估

圖像基準測試

基準測試	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMU_val	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2(en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

視頻基準測試

基準測試	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1		41.3	47.3
EgoSchema	72.2	71.2		77.9	76.2
PerceptionTest_test				68.0	73.2
MLVU_M-Avg_dev	64.6		75.7		74.6
TempCompass_overall	73.8				74.8

代理基準測試

基準測試	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5% (SoM)		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

🔧 技術細節

文檔中未提供相關技術細節。

📄 許可證

本項目採用 qwen 許可證。

📚 引用

如果你覺得我們的工作有幫助，請引用以下文獻：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}