Qwen2.5-VL-7B-Instruct-GGUF開源多模態模型 - 支持圖像理解與文本生成

首頁

Qwen2.5 VL 7B Instruct GGUF

由Mungert開發

Qwen2.5-VL-7B-Instruct 是一個多模態視覺語言模型，支持圖像理解和文本生成任務。

圖像生成文本英語開源協議:Apache-2.0 #多模態視覺理解 #超低比特量化 #邊緣設備部署

下載量 17.10k

發布時間 : 3/27/2025

模型概述

該模型是基於Qwen2.5架構的多模態模型，能夠處理圖像和文本輸入，生成相應的文本輸出。適用於圖像描述、視覺問答等任務。

模型特點

多模態支持

能夠同時處理圖像和文本輸入，生成相應的文本輸出。

超低比特量化

採用IQ-DynamicGate技術，支持1-2比特量化，在保持高精度的同時顯著減少模型大小。

動態精度分配

通過分層策略，對不同層採用不同的量化精度，優化模型性能。

模型能力

圖像描述

視覺問答

多模態推理

使用案例

圖像理解

圖像描述生成

輸入一張圖片，模型生成對該圖片的詳細描述。

生成準確且詳細的圖像描述。

視覺問答

基於圖像的問答

輸入一張圖片和相關問題，模型生成答案。

生成與圖像內容相關的準確答案。

🚀 Qwen2.5-VL-7B-Instruct GGUF模型

Qwen2.5-VL-7B-Instruct GGUF模型是一系列專為圖像文本到文本處理設計的多模態模型。這些模型基於transformers庫構建，能夠理解和處理圖像與文本信息，在視覺語言任務中表現出色。

🚀 快速開始

使用llama.cpp運行Qwen 2.5 VL Instruct模型

下載Qwen 2.5 VL gguf文件：訪問鏈接：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/tree/main，選擇名稱中不包含mmproj的gguf文件。示例gguf文件：https://huggingface.co/Mungert/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-q8_0.gguf 將該文件複製到你選擇的文件夾。
下載Qwen 2.5 VL mmproj文件：同樣訪問上述鏈接，選擇名稱中包含mmproj的文件。示例mmproj文件：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf 將該文件複製到你選擇的文件夾。
複製圖像文件：將圖像複製到與gguf文件相同的文件夾，或者適當修改路徑。示例圖像：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/car-1.jpg 將該文件複製到你選擇的文件夾。
運行CLI工具：在你選擇的文件夾中運行以下命令：

llama-mtmd-cli -m Qwen2.5-VL-7B-Instruct-q8_0.gguf --mmproj Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf  -p "Describe this image." --image ./car-1.jpg

使用🤗 Transformers進行對話

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用flash_attention_2以獲得更好的加速和內存節省，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# 模型中每個圖像的視覺令牌數量的默認範圍是4 - 16384
# 你可以根據需要設置min_pixels和max_pixels，例如令牌範圍為256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ 主要特性

視覺理解能力提升

Qwen2.5-VL不僅能夠識別常見物體，如花卉、鳥類、魚類和昆蟲，還能對圖像中的文本、圖表、圖標、圖形和佈局進行深入分析。

智能代理功能

Qwen2.5-VL可直接作為視覺代理，進行推理並動態調用工具，具備計算機和手機使用能力。

長視頻理解與事件捕捉

Qwen2.5-VL能夠理解長達1小時以上的視頻，並具備捕捉事件的新能力，可精準定位相關視頻片段。

多格式視覺定位

Qwen2.5-VL可以通過生成邊界框或點來準確地定位圖像中的物體，併為座標和屬性提供穩定的JSON輸出。

結構化輸出生成

對於發票、表單、表格等掃描數據，Qwen2.5-VL支持生成其內容的結構化輸出，有助於金融、商業等領域的應用。

📦 安裝指南

安裝依賴庫

pip install git+https://github.com/huggingface/transformers accelerate

安裝工具包

# 強烈建議使用`[decord]`特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

💻 使用示例

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻URL和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

📚 詳細文檔

選擇合適的模型格式

選擇正確的模型格式取決於你的硬件能力和內存限制。

BF16（Brain Float 16）

一種16位浮點格式，專為更快的計算而設計，同時保持良好的精度。
提供與FP32相似的動態範圍，但內存使用更低。
如果你的硬件支持BF16加速（請檢查設備規格），建議使用。
與FP32相比，適用於高性能推理，且內存佔用減少。

F16（Float 16）

一種16位浮點格式，精度較高，但取值範圍比BF16小。
適用於大多數支持FP16加速的設備（包括許多GPU和一些CPU）。
數值精度略低於BF16，但通常足以進行推理。

量化模型（Q4_K、Q6_K、Q8等）

量化可以在儘可能保持準確性的同時減小模型大小和內存使用。

低比特模型（Q4_K）：最適合最小化內存使用，但可能精度較低。
高比特模型（Q6_K、Q8_0）：精度更高，但需要更多內存。

極低比特量化（IQ3_XS、IQ3_S、IQ3_M、Q4_K、Q4_0）

這些模型針對極端內存效率進行了優化，非常適合低功耗設備或大規模部署，其中內存是關鍵限制因素。

模型文件詳情

`Qwen2.5-VL-7B-Instruct-bf16.gguf`

模型權重以BF16格式保存。
如果你想將模型重新量化為不同格式，請使用此文件。
如果你的設備支持BF16加速，則最佳選擇。

`Qwen2.5-VL-7B-Instruct-f16.gguf`

模型權重以F16格式保存。
如果你的設備支持FP16，特別是在BF16不可用時，請使用此文件。

`Qwen2.5-VL-7B-Instruct-bf16-q8_0.gguf`

輸出和嵌入層保持為BF16。
所有其他層量化為Q8_0。
如果你的設備支持BF16，並且你想要一個量化版本，請使用此文件。

`Qwen2.5-VL-7B-Instruct-f16-q8_0.gguf`

輸出和嵌入層保持為F16。
所有其他層量化為Q8_0。

`Qwen2.5-VL-7B-Instruct-q4_k.gguf`

輸出和嵌入層量化為Q8_0。
所有其他層量化為Q4_K。
適用於內存有限的CPU推理。

`Qwen2.5-VL-7B-Instruct-q4_k_s.gguf`

最小的Q4_K變體，以犧牲精度為代價減少內存使用。
最適合極低內存設置。

`Qwen2.5-VL-7B-Instruct-q6_k.gguf`

輸出和嵌入層量化為Q8_0。
所有其他層量化為Q6_K。

`Qwen2.5-VL-7B-Instruct-q8_0.gguf`

完全Q8量化的模型，以獲得更高的精度。
需要更多內存，但提供更高的精度。

`Qwen2.5-VL-7B-Instruct-iq3_xs.gguf`

IQ3_XS量化，針對極端內存效率進行了優化。
最適合超低內存設備。

`Qwen2.5-VL-7B-Instruct-iq3_m.gguf`

IQ3_M量化，提供中等塊大小以提高精度。
適用於低內存設備。

`Qwen2.5-VL-7B-Instruct-q4_0.gguf`

純Q4_0量化，針對ARM設備進行了優化。
最適合基於ARM的設備或低內存環境。
為了獲得更好的精度，建議使用IQ4_NL。

處理長文本

當前的config.json設置為上下文長度最大為32,768個令牌。為了處理超過32,768個令牌的大量輸入，我們使用了YaRN技術，該技術用於增強模型的長度外推能力，確保在長文本上的最佳性能。

圖像分辨率調整

模型支持廣泛的分辨率輸入。默認情況下，它使用原生分辨率進行輸入，但更高的分辨率可以提高性能，但會增加計算量。用戶可以設置最小和最大像素數，以實現適合自己需求的最佳配置，例如令牌計數範圍為256 - 1280，以平衡速度和內存使用。

🔧 技術細節

超低比特量化與IQ-DynamicGate（1 - 2比特）

我們最新的量化方法為超低比特模型（1 - 2比特）引入了精度自適應量化，並在Llama-3-8B上通過基準測試證明了其改進效果。這種方法使用特定層的策略來保持準確性，同時保持極高的內存效率。

模型架構更新

動態分辨率和幀率訓練用於視頻理解

我們通過採用動態FPS採樣將動態分辨率擴展到時間維度，使模型能夠理解不同採樣率的視頻。相應地，我們在時間維度上使用ID和絕對時間對齊更新mRoPE，使模型能夠學習時間序列和速度，最終獲得定位特定時刻的能力。

精簡高效的視覺編碼器

我們通過在ViT中策略性地實現窗口注意力，提高了訓練和推理速度。ViT架構進一步通過SwiGLU和RMSNorm進行了優化，使其與Qwen2.5 LLM的結構保持一致。

📄 許可證

本項目採用Apache-2.0許可證。

📈 評估

圖像基準測試

基準測試	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B
MMMU_val	56	50.4	60	54.1	58.6
MMMU-Pro_val	34.3	-	37.6	30.5	41.0
DocVQA_test	93	93	-	94.5	95.7
InfoVQA_test	77.6	-	-	76.5	82.6
ChartQA_test	84.8	-	-	83.0	87.3
TextVQA_val	79.1	80.1	-	84.3	84.9
OCRBench	822	852	785	845	864
CC_OCR	57.7	-	-	61.6	77.8
MMStar	62.8	-	-	60.7	63.9
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6
MMT-Bench_test	-	-	-	63.7	63.6
MMStar	61.5	57.5	54.8	60.7	63.9
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1
HallBench_avg	45.2	48.1	46.1	50.6	52.9
MathVista_testmini	58.3	60.6	52.4	58.2	68.2
MathVision	-	-	-	16.3	25.07

視頻基準測試

基準測試	Qwen2-VL-7B	Qwen2.5-VL-7B
MVBench	67.0	69.6
PerceptionTest_test	66.9	70.5
Video-MME_{wo/w subs}	63.3/69.0	65.1/71.6
LVBench	-	45.3
LongVideoBench	-	54.7
MMBench-Video	1.44	1.79
TempCompass	-	71.7
MLVU	-	70.2
CharadesSTA/mIoU	43.6	-

代理基準測試

基準測試	Qwen2.5-VL-7B
ScreenSpot	84.7
ScreenSpot Pro	29.0
AITZ_EM	81.9
Android Control High_EM	60.1
Android Control Low_EM	93.7
AndroidWorld_SR	25.5
MobileMiniWob++_SR	91.4

📖 引用

如果你覺得我們的工作有幫助，請引用以下內容：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}