Qwen2.5-VL-3B-Instruct-GGUF开源视觉语言模型 - 免费实现强大视觉理解与多模态处理

首页

Qwen2.5 VL 3B Instruct GGUF

由 unsloth 开发

Qwen2.5-VL是Qwen家族的最新视觉语言模型，具备强大的视觉理解和多模态处理能力。

图像生成文本英语#多模态视觉理解 #视频时序分析 #结构化数据提取

下载量 4,645

发布时间 : 5/11/2025

模型简介

Qwen2.5-VL是一个多模态视觉语言模型，专注于提升视觉理解、智能体功能和结构化输出生成能力。

模型特点

增强视觉理解

能精准识别常见物体，擅长分析图像中的文本、图表、图标、图形与版式布局

智能体功能

可直接作为视觉智能体进行推理并动态调用工具，支持电脑与手机操作场景

长视频理解

可解析超过1小时的视频内容，具备精准定位相关视频片段的事件捕捉能力

多格式视觉定位

通过生成边界框或坐标点精确定位图像对象，并能稳定输出JSON格式的坐标与属性数据

结构化输出生成

针对发票扫描件、表单、表格等数据，支持内容结构化输出

模型能力

图像文本理解

视觉对象定位

视频内容分析

结构化数据提取

多模态推理

工具调用

使用案例

商业应用

发票处理

自动识别和提取发票中的结构化数据

提高财务处理效率

表单分析

解析各类商业表单内容

简化数据录入流程

智能助手

视觉智能体

作为智能体进行视觉推理并调用工具

实现自动化操作

内容分析

视频内容理解

解析长视频内容并定位关键事件

提高视频分析效率

🚀 Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct 是 Qwen 家族的最新视觉语言模型，具备理解图像和视频内容、视觉定位、生成结构化输出等能力，能广泛应用于金融、商业等领域。

🚀 快速开始

安装依赖

Qwen2.5-VL 的代码已集成在最新的 Hugging face transformers 中，建议使用以下命令从源代码进行安装：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

同时，我们提供了一个工具包，可帮助你更方便地处理各种类型的视觉输入，就像使用 API 一样。它支持 base64、URL 以及交错的图像和视频。可以使用以下命令进行安装：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

如果你使用的不是 Linux 系统，可能无法从 PyPI 安装 decord。在这种情况下，你可以使用 pip install qwen-vl-utils，它会回退到使用 torchvision 进行视频处理。不过，你仍然可以从源代码安装 decord，以便在加载视频时使用 decord。

使用 🤗 Transformers 进行对话

以下是一个代码片段，展示了如何使用 transformers 和 qwen_vl_utils 来使用对话模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 你可以根据需要设置 min_pixels 和 max_pixels，例如令牌范围为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频 URL 和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频 URL 兼容性在很大程度上取决于第三方库的版本。详细信息如下表所示。如果你不想使用默认的后端，可以通过 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 来更改后端。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

强烈建议用户（特别是中国大陆的用户）使用 ModelScope。snapshot_download 可以帮助你解决下载检查点时遇到的问题。

✨ 主要特性

视觉理解能力

Qwen2.5-VL 不仅擅长识别常见物体，如花卉、鸟类、鱼类和昆虫，还能够高度准确地分析图像中的文本、图表、图标、图形和布局。

智能代理能力

Qwen2.5-VL 可直接作为视觉代理，能够进行推理并动态调用工具，具备计算机和手机使用能力。

长视频理解和事件捕捉能力

Qwen2.5-VL 可以理解超过 1 小时的视频，并且此次新增了通过定位相关视频片段来捕捉事件的能力。

多格式视觉定位能力

Qwen2.5-VL 可以通过生成边界框或点来准确地在图像中定位物体，并能为坐标和属性提供稳定的 JSON 输出。

结构化输出生成能力

对于发票、表单、表格等扫描数据，Qwen2.5-VL 支持生成其内容的结构化输出，有助于金融、商业等领域的应用。

📦 安装指南

Qwen2.5-VL 的代码已集成在最新的 Hugging face transformers 中，建议使用以下命令从源代码进行安装：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

💻 使用示例

基础用法

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 你可以根据需要设置 min_pixels 和 max_pixels，例如令牌范围为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

高级用法

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频 URL 和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

📚 详细文档

输入图像和视频格式支持

对于输入图像，支持本地文件、base64 和 URL 格式。对于视频，目前仅支持本地文件。

# 你可以直接在文本中需要的位置插入本地文件路径、URL 或 base64 编码的图像
## 本地文件路径
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## 图像 URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 编码的图像
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

图像分辨率优化性能

模型支持广泛的分辨率输入。默认情况下，它使用原生分辨率进行输入，但更高的分辨率可以提高性能，但会增加计算成本。用户可以设置最小和最大像素数，以实现满足自身需求的最佳配置，例如令牌数量范围为 256 - 1280，以平衡速度和内存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，还提供了两种方法来精细控制输入到模型的图像大小：

定义 min_pixels 和 max_pixels：图像将被调整大小，以保持其宽高比在 min_pixels 和 max_pixels 范围内。
指定确切的尺寸：直接设置 resized_height 和 resized_width。这些值将被四舍五入到最接近的 28 的倍数。

# min_pixels 和 max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height 和 resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

长文本处理

当前的 config.json 设置的上下文长度最大为 32,768 个令牌。为了处理超过 32,768 个令牌的大量输入，我们采用了 YaRN 技术，这是一种增强模型长度外推能力的技术，可确保在长文本上的最佳性能。对于支持的框架，可以在 config.json 中添加以下内容以启用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

需要注意的是，这种方法对时间和空间定位任务的性能有显著影响，因此不建议使用。同时，对于长视频输入，由于 MRoPE 本身在使用 ids 方面更经济，因此可以直接将 max_position_embeddings 修改为更大的值，例如 64k。

🔧 技术细节

模型架构更新

视频理解的动态分辨率和帧率训练

通过采用动态 FPS 采样，将动态分辨率扩展到时间维度，使模型能够理解不同采样率的视频。相应地，在时间维度上使用 ID 和绝对时间对齐更新 mRoPE，使模型能够学习时间序列和速度，最终获得定位特定时刻的能力。模型架构

精简高效的视觉编码器

通过策略性地将窗口注意力机制引入 ViT，提高了训练和推理速度。同时，使用 SwiGLU 和 RMSNorm 进一步优化 ViT 架构，使其与 Qwen2.5 LLM 的结构保持一致。

📄 许可证

本项目遵循 qwen-research 许可证。

📈 评估

图像基准测试

基准测试	InternVL2.5-4B	Qwen2-VL-7B	Qwen2.5-VL-3B
MMMU_val	52.3	54.1	53.1
MMMU-Pro_val	32.7	30.5	31.6
AI2D_test	81.4	83.0	81.5
DocVQA_test	91.6	94.5	93.9
InfoVQA_test	72.1	76.5	77.1
TextVQA_val	76.8	84.3	79.3
MMBench-V1.1_test	79.3	80.7	77.6
MMStar	58.3	60.7	55.9
MathVista_testmini	60.5	58.2	62.3
MathVision_full	20.9	16.3	21.2

视频基准测试

基准测试	InternVL2.5-4B	Qwen2-VL-7B	Qwen2.5-VL-3B
MVBench	71.6	67.0	67.0
VideoMME	63.6/62.3	69.0/63.3	67.6/61.5
MLVU	48.3	-	68.2
LVBench	-	-	43.3
MMBench-Video	1.73	1.44	1.63
EgoSchema	-	-	64.8
PerceptionTest	-	-	66.9
TempCompass	-	-	64.4
LongVideoBench	55.2	55.6	54.2
CharadesSTA/mIoU	-	-	38.8

代理基准测试

基准测试	Qwen2.5-VL-3B
ScreenSpot	55.5
ScreenSpot Pro	23.9
AITZ_EM	76.9
Android Control High_EM	63.7
Android Control Low_EM	22.2
AndroidWorld_SR	90.8
MobileMiniWob++_SR	67.9

📖 引用

如果您觉得我们的工作有帮助，请引用以下文献：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}