Qwen2-VL-2B-Instruct-GPTQ-Int4开源模型 - 免费提供强大图像与视频多模态处理能力

首页

Qwen2 VL 2B Instruct GPTQ Int4

由 h2oai 开发

Qwen2-VL是Qwen-VL模型的最新版本，在图像理解、视频处理、多模态交互等方面有显著提升，提供强大的视觉语言处理能力。

图像生成文本

Safetensors

英语开源协议:Apache-2.0 #动态分辨率视觉理解 #20分钟视频处理 #多模态智能体控制

下载量 3,074

发布时间 : 11/14/2024

模型简介

Qwen2-VL是一个视觉语言模型，支持图像和视频理解、多模态交互，具备多语言支持能力，适用于多种视觉语言处理任务。

模型特点

动态分辨率支持

可以处理任意图像分辨率，映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。

多模态旋转位置嵌入

将位置嵌入分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强多模态处理能力。

长视频理解

能够理解超过20分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。

多语言支持

支持理解图像中不同语言的文本，包括英语、中文、大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型能力

图像理解

视频处理

多模态交互

多语言文本识别

视觉问答

内容创作

使用案例

视觉问答

图像描述

根据输入的图像生成描述性文本。

准确描述图像内容

视频问答

根据输入的视频回答问题。

理解视频内容并回答问题

智能体集成

手机操作

根据视觉环境和文本指令自动操作手机。

实现自动化操作

机器人控制

根据视觉环境和文本指令控制机器人。

实现智能决策和操作

内容创作

视频内容生成

根据视频内容生成描述或创作相关内容。

生成高质量的内容描述

🚀 Qwen2-VL-2B-Instruct-GPTQ-Int4

Qwen2-VL是Qwen-VL模型的最新版本，凝聚了近一年的创新成果。它在图像理解、视频处理、多模态交互等方面有显著提升，能为用户带来更强大的视觉语言处理能力。

🚀 快速开始

Qwen2-VL的代码已集成到最新的Hugging face transformers中，建议使用以下命令从源代码构建：

pip install git+https://github.com/huggingface/transformers

否则可能会遇到以下错误：

KeyError: 'qwen2_vl'

我们提供了一个工具包，方便你处理各种类型的视觉输入，包括base64编码、URL链接以及交错的图像和视频。可以使用以下命令进行安装：

pip install qwen-vl-utils

以下是一个使用transformers和qwen_vl_utils调用聊天模型的代码示例：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)

# 建议启用flash_attention_2以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4")

# 模型中每张图像的视觉标记数量默认范围是4 - 16384。你可以根据需要设置min_pixels和max_pixels，例如标记数量范围为256 - 1280，以平衡速度和内存使用
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels)


messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils的情况

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 在可用设备上以半精度加载模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4")

# 图像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# 预处理输入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预期输出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成输出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages1]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ 主要特性

Qwen2-VL的新特性

关键增强功能

对各种分辨率和比例图像的最优理解：Qwen2-VL在视觉理解基准测试中取得了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解超过20分钟的视频：Qwen2-VL可以理解超过20分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。
可操作手机、机器人等的智能体：凭借复杂推理和决策能力，Qwen2-VL可以与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为了服务全球用户，除了英语和中文，Qwen2-VL现在支持理解图像中不同语言的文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新

朴素动态分辨率：与以往不同，Qwen2-VL可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。

多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们有三个分别具有20亿、70亿和720亿参数的模型。本仓库包含经过指令微调的20亿参数Qwen2-VL模型的量化版本。更多信息，请访问我们的博客和GitHub。

基准测试

量化模型的性能

本节报告了Qwen2-VL系列量化模型（包括GPTQ和AWQ）的生成性能。具体来说，我们报告以下指标：

MMMU_VAL（准确率）
DocVQA_VAL（准确率）
MMBench_DEV_EN（准确率）
MathVista_MINI（准确率）

我们使用VLMEvalkit来评估所有模型。

模型大小	量化方式	MMMU	DocVQA	MMBench	MathVista
Qwen2-VL-2B-Instruct	BF16 ^(🤗🤖)	41.88	88.34	72.07	44.40
	GPTQ-Int8 ^(🤗🤖)	41.55	88.28	71.99	44.60
	GPTQ-Int4 ^(🤗🤖)	39.22	87.21	70.87	41.69
	AWQ ^(🤗🤖)	41.33	86.96	71.64	39.90

速度基准测试

本节报告了Qwen2-VL系列bf16模型、量化模型（包括GPTQ-Int4、GPTQ-Int8和AWQ）的速度性能。具体来说，我们报告在不同上下文长度条件下的推理速度（标记/秒）和内存占用（GB）。

使用huggingface transformers进行评估的环境如下：

NVIDIA A100 80GB
CUDA 11.8
Pytorch 2.2.1+cu118
Flash Attention 2.6.1
Transformers 4.38.2
AutoGPTQ 0.6.0+cu118
AutoAWQ 0.2.5+cu118 (autoawq_kernels 0.0.6+cu118)

注意：

我们在评估中使用批量大小为1，并尽可能使用最少数量的GPU。
我们测试了输入长度为1、6144、14336、30720、63488和129024标记时生成2048个标记的速度和内存。
2B（transformers）

模型	输入长度	量化方式	GPU数量	速度（标记/秒）	GPU内存（GB）
Qwen2-VL-2B-Instruct	1	BF16	1	35.29	4.68
		GPTQ-Int8	1	28.59	3.55
		GPTQ-Int4	1	39.76	2.91
		AWQ	1	29.89	2.88
	6144	BF16	1	36.58	10.01
		GPTQ-Int8	1	29.53	8.87
		GPTQ-Int4	1	39.27	8.21
		AWQ	1	33.42	8.18
	14336	BF16	1	36.31	17.20
		GPTQ-Int8	1	31.03	16.07
		GPTQ-Int4	1	39.89	15.40
		AWQ	1	32.28	15.40
	30720	BF16	1	32.53	31.64
		GPTQ-Int8	1	27.76	30.51
		GPTQ-Int4	1	30.73	29.84
		AWQ	1	31.55	29.84

🔧 技术细节

模型架构

朴素动态分辨率：Qwen2-VL可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。
多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

📄 许可证

本项目采用Apache-2.0许可证。

📚 详细文档

局限性

虽然Qwen2-VL适用于广泛的视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至2023年6月，此日期之后的信息可能未被涵盖。
个体和知识产权识别限制：模型识别特定个体或知识产权的能力有限，可能无法全面覆盖所有知名人物或品牌。
复杂指令处理能力有限：面对复杂的多步骤指令时，模型的理解和执行能力有待提高。
计数准确性不足：特别是在复杂场景中，物体计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在3D空间中，模型对物体位置关系的推理不足，难以精确判断物体的相对位置。

这些局限性是模型优化和改进的持续方向，我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助，请随意引用我们的成果。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}