Qwen2-VL-2B-Instruct-GPTQ-Int4開源模型 - 免費提供強大圖像與視頻多模態處理能力

首頁

Qwen2 VL 2B Instruct GPTQ Int4

由h2oai開發

Qwen2-VL是Qwen-VL模型的最新版本，在圖像理解、視頻處理、多模態交互等方面有顯著提升，提供強大的視覺語言處理能力。

圖像生成文本

Safetensors

英語開源協議:Apache-2.0 #動態分辨率視覺理解 #20分鐘視頻處理 #多模態智能體控制

下載量 3,074

發布時間 : 11/14/2024

模型概述

Qwen2-VL是一個視覺語言模型，支持圖像和視頻理解、多模態交互，具備多語言支持能力，適用於多種視覺語言處理任務。

模型特點

動態分辨率支持

可以處理任意圖像分辨率，映射到動態數量的視覺標記，提供更接近人類的視覺處理體驗。

多模態旋轉位置嵌入

將位置嵌入分解為多個部分，以捕獲一維文本、二維視覺和三維視頻的位置信息，增強多模態處理能力。

長視頻理解

能夠理解超過20分鐘的視頻，用於高質量的基於視頻的問答、對話、內容創作等。

多語言支持

支持理解圖像中不同語言的文本，包括英語、中文、大多數歐洲語言、日語、韓語、阿拉伯語、越南語等。

模型能力

圖像理解

視頻處理

多模態交互

多語言文本識別

視覺問答

內容創作

使用案例

視覺問答

圖像描述

根據輸入的圖像生成描述性文本。

準確描述圖像內容

視頻問答

根據輸入的視頻回答問題。

理解視頻內容並回答問題

智能體集成

手機操作

根據視覺環境和文本指令自動操作手機。

實現自動化操作

機器人控制

根據視覺環境和文本指令控制機器人。

實現智能決策和操作

內容創作

視頻內容生成

根據視頻內容生成描述或創作相關內容。

生成高質量的內容描述

🚀 Qwen2-VL-2B-Instruct-GPTQ-Int4

Qwen2-VL是Qwen-VL模型的最新版本，凝聚了近一年的創新成果。它在圖像理解、視頻處理、多模態交互等方面有顯著提升，能為用戶帶來更強大的視覺語言處理能力。

🚀 快速開始

Qwen2-VL的代碼已集成到最新的Hugging face transformers中，建議使用以下命令從源代碼構建：

pip install git+https://github.com/huggingface/transformers

否則可能會遇到以下錯誤：

KeyError: 'qwen2_vl'

我們提供了一個工具包，方便你處理各種類型的視覺輸入，包括base64編碼、URL鏈接以及交錯的圖像和視頻。可以使用以下命令進行安裝：

pip install qwen-vl-utils

以下是一個使用transformers和qwen_vl_utils調用聊天模型的代碼示例：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)

# 建議啟用flash_attention_2以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4")

# 模型中每張圖像的視覺標記數量默認範圍是4 - 16384。你可以根據需要設置min_pixels和max_pixels，例如標記數量範圍為256 - 1280，以平衡速度和內存使用
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels)


messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils的情況

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 在可用設備上以半精度加載模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4")

# 圖像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# 預處理輸入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 預期輸出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成輸出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages1]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ 主要特性

Qwen2-VL的新特性

關鍵增強功能

對各種分辨率和比例圖像的最優理解：Qwen2-VL在視覺理解基準測試中取得了最先進的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解超過20分鐘的視頻：Qwen2-VL可以理解超過20分鐘的視頻，用於高質量的基於視頻的問答、對話、內容創作等。
可操作手機、機器人等的智能體：憑藉複雜推理和決策能力，Qwen2-VL可以與手機、機器人等設備集成，根據視覺環境和文本指令進行自動操作。
多語言支持：為了服務全球用戶，除了英語和中文，Qwen2-VL現在支持理解圖像中不同語言的文本，包括大多數歐洲語言、日語、韓語、阿拉伯語、越南語等。

模型架構更新

樸素動態分辨率：與以往不同，Qwen2-VL可以處理任意圖像分辨率，將其映射到動態數量的視覺標記，提供更接近人類的視覺處理體驗。

多模態旋轉位置嵌入（M-ROPE）：將位置嵌入分解為多個部分，以捕獲一維文本、二維視覺和三維視頻的位置信息，增強其多模態處理能力。

我們有三個分別具有20億、70億和720億參數的模型。本倉庫包含經過指令微調的20億參數Qwen2-VL模型的量化版本。更多信息，請訪問我們的博客和GitHub。

基準測試

量化模型的性能

本節報告了Qwen2-VL系列量化模型（包括GPTQ和AWQ）的生成性能。具體來說，我們報告以下指標：

MMMU_VAL（準確率）
DocVQA_VAL（準確率）
MMBench_DEV_EN（準確率）
MathVista_MINI（準確率）

我們使用VLMEvalkit來評估所有模型。

模型大小	量化方式	MMMU	DocVQA	MMBench	MathVista
Qwen2-VL-2B-Instruct	BF16 ^(🤗🤖)	41.88	88.34	72.07	44.40
	GPTQ-Int8 ^(🤗🤖)	41.55	88.28	71.99	44.60
	GPTQ-Int4 ^(🤗🤖)	39.22	87.21	70.87	41.69
	AWQ ^(🤗🤖)	41.33	86.96	71.64	39.90

速度基準測試

本節報告了Qwen2-VL系列bf16模型、量化模型（包括GPTQ-Int4、GPTQ-Int8和AWQ）的速度性能。具體來說，我們報告在不同上下文長度條件下的推理速度（標記/秒）和內存佔用（GB）。

使用huggingface transformers進行評估的環境如下：

NVIDIA A100 80GB
CUDA 11.8
Pytorch 2.2.1+cu118
Flash Attention 2.6.1
Transformers 4.38.2
AutoGPTQ 0.6.0+cu118
AutoAWQ 0.2.5+cu118 (autoawq_kernels 0.0.6+cu118)

注意：

我們在評估中使用批量大小為1，並儘可能使用最少數量的GPU。
我們測試了輸入長度為1、6144、14336、30720、63488和129024標記時生成2048個標記的速度和內存。
2B（transformers）

模型	輸入長度	量化方式	GPU數量	速度（標記/秒）	GPU內存（GB）
Qwen2-VL-2B-Instruct	1	BF16	1	35.29	4.68
		GPTQ-Int8	1	28.59	3.55
		GPTQ-Int4	1	39.76	2.91
		AWQ	1	29.89	2.88
	6144	BF16	1	36.58	10.01
		GPTQ-Int8	1	29.53	8.87
		GPTQ-Int4	1	39.27	8.21
		AWQ	1	33.42	8.18
	14336	BF16	1	36.31	17.20
		GPTQ-Int8	1	31.03	16.07
		GPTQ-Int4	1	39.89	15.40
		AWQ	1	32.28	15.40
	30720	BF16	1	32.53	31.64
		GPTQ-Int8	1	27.76	30.51
		GPTQ-Int4	1	30.73	29.84
		AWQ	1	31.55	29.84

🔧 技術細節

模型架構

樸素動態分辨率：Qwen2-VL可以處理任意圖像分辨率，將其映射到動態數量的視覺標記，提供更接近人類的視覺處理體驗。
多模態旋轉位置嵌入（M-ROPE）：將位置嵌入分解為多個部分，以捕獲一維文本、二維視覺和三維視頻的位置信息，增強其多模態處理能力。

📄 許可證

本項目採用Apache-2.0許可證。

📚 詳細文檔

侷限性

雖然Qwen2-VL適用於廣泛的視覺任務，但瞭解其侷限性同樣重要。以下是一些已知的限制：

缺乏音頻支持：當前模型無法理解視頻中的音頻信息。
數據時效性：我們的圖像數據集更新至2023年6月，此日期之後的信息可能未被涵蓋。
個體和知識產權識別限制：模型識別特定個體或知識產權的能力有限，可能無法全面覆蓋所有知名人物或品牌。
複雜指令處理能力有限：面對複雜的多步驟指令時，模型的理解和執行能力有待提高。
計數準確性不足：特別是在複雜場景中，物體計數的準確性不高，需要進一步改進。
空間推理能力較弱：特別是在3D空間中，模型對物體位置關係的推理不足，難以精確判斷物體的相對位置。

這些侷限性是模型優化和改進的持續方向，我們致力於不斷提升模型的性能和應用範圍。

引用

如果您覺得我們的工作有幫助，請隨意引用我們的成果。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}