Qwen2.5-VL-7B-Instruct-GGUF開源視覺語言模型 - 免費實現圖像視頻分析和結構化輸出

首頁

Qwen2.5 VL 7B Instruct GGUF

由unsloth開發

Qwen2.5-VL是Qwen家族最新推出的視覺語言模型，具備強大的視覺理解和多模態處理能力，支持圖像、視頻分析和結構化輸出。

圖像生成文本英語開源協議:Apache-2.0 #多模態代理 #長視頻理解 #結構化數據提取

下載量 8,427

發布時間 : 5/11/2025

模型概述

Qwen2.5-VL是一款多模態視覺語言模型，專注於提升視覺理解、智能體功能和結構化輸出能力，適用於金融、商業等多種場景。

模型特點

增強視覺理解

精準識別物體、文本、圖表、圖標和版式佈局，支持複雜視覺內容分析

智能體功能

可直接作為視覺智能體運行，動態調用工具，支持計算機和手機操作場景

長視頻理解

可解析超過1小時的視頻內容，具備精準定位相關片段的事件捕捉能力

結構化輸出

針對發票、表格等數據支持結構化輸出，適用於金融、商業等專業場景

模型能力

圖像分析

視頻理解

文本識別

圖表解析

視覺定位

結構化數據提取

多模態推理

使用案例

商業分析

發票處理

自動提取發票中的結構化數據

準確率高達95.7%（DocVQA測試集）

教育

圖表理解

解析教學材料中的圖表信息

ChartQA測試集準確率87.3%

智能助手

視覺智能體

作為智能體執行屏幕操作任務

ScreenSpot測試集得分84.7

🚀 Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct是Qwen系列的最新視覺語言模型，具備強大的視覺理解、分析和推理能力，能處理圖像、視頻等多模態數據，適用於金融、商業等多個領域。

🚀 快速開始

安裝依賴

Qwen2.5-VL的代碼已集成在最新的Hugging Face Transformers庫中，建議使用以下命令從源代碼進行安裝：

pip install git+https://github.com/huggingface/transformers accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

同時，可安裝一個工具包來更方便地處理各種類型的視覺輸入，包括base64、URL以及交錯的圖像和視頻：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

若不使用Linux系統，可能無法從PyPI安裝decord，此時可使用pip install qwen-vl-utils，它會回退到使用torchvision進行視頻處理。不過，仍可從源代碼安裝decord，以便在加載視頻時使用decord。

使用🤗 Transformers進行對話

以下是一個使用transformers和qwen_vl_utils調用聊天模型的代碼示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是 4 - 16384
# 可根據需要設置 min_pixels 和 max_pixels，例如令牌範圍為 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻URL和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻URL兼容性在很大程度上取決於第三方庫的版本。詳情見下表。如果不想使用默認的後端，可以通過FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord來更改。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

強烈建議用戶（特別是中國大陸的用戶）使用ModelScope。snapshot_download可以幫助解決下載檢查點時遇到的問題。

✨ 主要特性

關鍵增強功能

視覺理解能力：Qwen2.5-VL不僅擅長識別常見物體（如花鳥魚蟲），還能高度有效地分析圖像中的文本、圖表、圖標、圖形和佈局。
智能代理能力：Qwen2.5-VL可直接作為視覺代理，能夠進行推理並動態指導工具，具備計算機和手機使用能力。
長視頻理解與事件捕捉：Qwen2.5-VL可以理解超過1小時的視頻，並且此次具備了通過精確確定相關視頻片段來捕捉事件的新能力。
多格式視覺定位：Qwen2.5-VL可以通過生成邊界框或點來準確地在圖像中定位物體，並能為座標和屬性提供穩定的JSON輸出。
結構化輸出生成：對於發票、表單、表格等掃描數據，Qwen2.5-VL支持對其內容進行結構化輸出，有利於金融、商業等領域的應用。

模型架構更新

用於視頻理解的動態分辨率和幀率訓練：通過採用動態FPS採樣，將動態分辨率擴展到時間維度，使模型能夠理解各種採樣率的視頻。相應地，在時間維度上使用ID和絕對時間對齊更新mRoPE，使模型能夠學習時間序列和速度，最終獲得精確確定特定時刻的能力。
精簡高效的視覺編碼器：通過策略性地將窗口注意力機制引入ViT，提高了訓練和推理速度。同時，使用SwiGLU和RMSNorm進一步優化ViT架構，使其與Qwen2.5 LLM的結構保持一致。

目前有參數規模為30億、70億和720億的三種模型。本倉庫包含經過指令微調的70億參數的Qwen2.5-VL模型。更多信息，請訪問博客和GitHub。

📚 詳細文檔

評估

圖像基準測試

基準測試	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B
MMMU_val	56	50.4	60	54.1	58.6
MMMU-Pro_val	34.3	-	37.6	30.5	41.0
DocVQA_test	93	93	-	94.5	95.7
InfoVQA_test	77.6	-	-	76.5	82.6
ChartQA_test	84.8	-	-	83.0	87.3
TextVQA_val	79.1	80.1	-	84.3	84.9
OCRBench	822	852	785	845	864
CC_OCR	57.7			61.6	77.8
MMStar	62.8			60.7	63.9
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6
MMT-Bench_test	-	-	-	63.7	63.6
MMStar	61.5	57.5	54.8	60.7	63.9
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1
HallBench_avg	45.2	48.1	46.1	50.6	52.9
MathVista_testmini	58.3	60.6	52.4	58.2	68.2
MathVision	-	-	-	16.3	25.07

視頻基準測試

基準測試	Qwen2-VL-7B	Qwen2.5-VL-7B
MVBench	67.0	69.6
PerceptionTest_test	66.9	70.5
Video-MME_{wo/w subs}	63.3/69.0	65.1/71.6
LVBench		45.3
LongVideoBench		54.7
MMBench-Video	1.44	1.79
TempCompass		71.7
MLVU		70.2
CharadesSTA/mIoU	43.6

代理基準測試

基準測試	Qwen2.5-VL-7B
ScreenSpot	84.7
ScreenSpot Pro	29.0
AITZ_EM	81.9
Android Control High_EM	60.1
Android Control Low_EM	93.7
AndroidWorld_SR	25.5
MobileMiniWob++_SR	91.4

🔧 技術細節

模型架構

動態分辨率和幀率訓練

通過採用動態FPS採樣，將動態分辨率擴展到時間維度，使模型能夠理解各種採樣率的視頻。相應地，在時間維度上使用ID和絕對時間對齊更新mRoPE，使模型能夠學習時間序列和速度，最終獲得精確確定特定時刻的能力。

精簡高效的視覺編碼器

通過策略性地將窗口注意力機制引入ViT，提高了訓練和推理速度。同時，使用SwiGLU和RMSNorm進一步優化ViT架構，使其與Qwen2.5 LLM的結構保持一致。

長文本處理

當前的config.json設置的上下文長度最大為32,768個令牌。為了處理超過32,768個令牌的大量輸入，使用了YaRN技術來增強模型的長度外推能力，確保在長文本上的最佳性能。但這種方法對時間和空間定位任務的性能有顯著影響，因此不建議使用。對於長視頻輸入，由於MRoPE本身在ids方面更節省，因此可以直接將max_position_embeddings修改為更大的值，例如64k。

📄 許可證

本項目採用Apache 2.0許可證。

📚 引用

如果您覺得我們的工作有幫助，請引用以下內容：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}