Qwen2-VL開源多語言圖文識別模型 - 支持全分辨率圖像理解與超長視頻解析

首頁

Uground V1 72B Preview

由osunlp開發

Qwen2-VL是Qwen-VL模型系列的最新迭代，具備全分辨率圖像理解、超長視頻解析和多語言圖文識別能力。

圖像生成文本

Transformers

英語開源協議:其他 #全分辨率視覺理解 #超長視頻解析 #多語言圖文識別

下載量 21

發布時間 : 1/7/2025

模型概述

720億參數的多模態視覺語言模型，支持圖像理解、視頻分析、多語言文本識別和智能體操作等功能。

模型特點

全分辨率圖像理解

通過動態視覺token映射實現類人視覺處理體驗，在MathVista、DocVQA等基準測試中達到最先進水平

超長視頻理解

可解析20分鐘以上視頻內容，支持高質量視頻問答、對話及創作

智能體操作系統

結合複雜推理與決策能力，可集成手機、機器人等設備實現視覺環境驅動的自動化操作

多語言圖文理解

支持圖像內多語種文本識別，涵蓋主要歐洲語言、日語、韓語、阿拉伯語、越南語等

模型能力

圖像理解

視頻分析

多語言文本識別

智能體操作

複雜推理

決策支持

使用案例

文檔處理

文檔問答

解析文檔圖像並回答相關問題

在DocVQA測試集上達到96.5%準確率

教育

數學問題解答

解析數學圖表並解答問題

在MathVista測試集上達到70.5%準確率

智能設備

安卓設備操作

通過視覺理解控制安卓設備

在AITZ基準測試中類型匹配準確率89.6%

🚀 Qwen2-VL-72B-Instruct

Qwen2-VL-72B-Instruct 是 Qwen-VL 模型的最新版本，代表了近一年的創新成果。它在視覺理解、視頻處理、多模態交互等方面有顯著提升，支持多語言，能處理不同分辨率和比例的圖像，還可集成到移動設備和機器人中實現自動操作。

🚀 快速開始

依賴安裝

Qwen2-VL 的代碼已集成在最新的 Hugging face transformers 中，建議使用以下命令從源代碼構建安裝：

pip install git+https://github.com/huggingface/transformers

否則可能會遇到如下錯誤：

KeyError: 'qwen2_vl'

同時，我們提供了一個工具包 qwen-vl-utils 來更方便地處理各種類型的視覺輸入，包括 base64、URL 以及交錯的圖像和視頻。可以使用以下命令進行安裝：

pip install qwen-vl-utils

代碼示例

以下是一個使用 transformers 和 qwen_vl_utils 調用聊天模型的代碼片段：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 模型中每張圖像的視覺標記數量默認範圍是 4 - 16384。可以根據需要設置 min_pixels 和 max_pixels，例如標記數量範圍為 256 - 1280，以平衡速度和內存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils 的情況

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 將模型以半精度加載到可用設備上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 圖像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 預處理輸入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 預期輸出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成輸出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages1]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ 主要特性

Qwen2-VL 的新特性

關鍵增強功能

對各種分辨率和比例圖像的最優理解：Qwen2-VL 在視覺理解基準測試中取得了最先進的性能，包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。
理解 20 分鐘以上的視頻：Qwen2-VL 可以理解超過 20 分鐘的視頻，用於高質量的基於視頻的問答、對話、內容創作等。
可操作移動設備、機器人等的智能體：憑藉複雜推理和決策能力，Qwen2-VL 可以與手機、機器人等設備集成，根據視覺環境和文本指令進行自動操作。
多語言支持：為服務全球用戶，除了英語和中文，Qwen2-VL 現在支持理解圖像內不同語言的文本，包括大多數歐洲語言、日語、韓語、阿拉伯語、越南語等。

模型架構更新

樸素動態分辨率：與之前不同，Qwen2-VL 可以處理任意圖像分辨率，將其映射到動態數量的視覺標記，提供更接近人類的視覺處理體驗。

- **多模態旋轉位置嵌入 (M - ROPE)**：將位置嵌入分解為多個部分，以捕獲 1D 文本、2D 視覺和 3D 視頻的位置信息，增強其多模態處理能力。

📚 詳細文檔

模型評估

圖像基準測試

基準測試	先前最優模型 ^{(開源大視覺語言模型)}	Claude - 3.5 Sonnet	GPT - 4o	Qwen2 - VL - 72B
MMMU_val	58.3	68.3	69.1	64.5
DocVQA_test	94.1	95.2	92.8	96.5
InfoVQA_test	82.0	-	-	84.5
ChartQA_test	88.4	90.8	85.7	88.3
TextVQA_val	84.4	-	-	85.5
OCRBench	852	788	736	877
MTVQA	17.3	25.7	27.8	30.9
VCR_{en easy}	84.67	63.85	91.55	91.93
VCR_{zh easy}	22.09	1.0	14.87	65.37
RealWorldQA	72.2	60.1	75.4	77.8
MME_sum	2414.7	1920.0	2328.7	2482.7
MMBench - EN_test	86.5	79.7	83.4	86.5
MMBench - CN_test	86.3	80.7	82.1	86.6
MMBench - V1.1_test	85.5	78.5	82.2	85.9
MMT - Bench_test	63.4	-	65.5	71.7
MMStar	67.1	62.2	63.9	68.3
MMVet_{GPT - 4 - Turbo}	65.7	66.0	69.1	74.0
HallBench_avg	55.2	49.9	55.0	58.1
MathVista_testmini	67.5	67.7	63.8	70.5
MathVision	16.97	-	30.4	25.9

視頻基準測試

基準測試	先前最優模型 ^{(開源大視覺語言模型)}	Gemini 1.5 - Pro	GPT - 4o	Qwen2 - VL - 72B
MVBench	69.6	-	-	73.6
PerceptionTest_test	66.9	-	-	68.0
EgoSchema_test	62.0	63.2	72.2	77.9
Video - MME _{(有無字幕)}	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8

智能體基準測試

	基準測試	指標	先前最優模型	GPT - 4o	Qwen2 - VL - 72B
通用	FnCall^[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
遊戲	數軸任務	SR	89.4^[2]	91.5	100.0
	21 點遊戲	SR	40.2^[2]	34.5	42.6
	EZPoint	SR	50.0^[2]	85.5	100.0
	24 點遊戲	SR	2.6^[2]	3.0	4.5
安卓	AITZ	TM	83.0^[3]	70.0	89.6
		EM	47.7^[3]	35.3	72.1
AI2THOR	ALFRED_{valid - unseen}	SR	67.7^[4]	-	67.8
		GC	75.3^[4]	-	75.8
視覺語言導航	R2R_{valid - unseen}	SR	79.0	43.7^[5]	51.7
	REVERIE_{valid - unseen}	SR	61.0	31.6^[5]	31.0

SR、GC、TM 和 EM 分別是成功率、目標條件成功率、類型匹配和精確匹配的縮寫。ALFRED 由 SAM^[6] 支持。

通義團隊自有的函數調用基準測試
《通過強化學習將大視覺語言模型微調為決策智能體》
《Android in the Zoo: 用於 GUI 智能體的動作思維鏈》
《ThinkBot: 基於思維鏈推理的具身指令跟隨》
《MapGPT: 用於視覺語言導航的自適應路徑規劃地圖引導提示》
《Segment Anything》

多語言基準測試

模型	AR	DE	FR	IT	JA	KO	RU	TH	VI	平均
Qwen2 - VL - 72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT - 4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

🔧 技術細節

此預覽模型使用 LoRA 進行了 1 個輪次的訓練。另一個經過完整訓練的檢查點：https://huggingface.co/osunlp/UGround - V1 - 72B（在 ScreenSpot - Pro 和 ScreenSpot 上表現略好）。

有參數為 20 億、80 億和 720 億的三個模型。此倉庫包含經過指令微調的 720 億參數的 Qwen2 - VL 模型。更多信息，請訪問博客和 GitHub。

📄 許可證

本模型使用通義千問許可證。

⚠️ 模型侷限性

雖然 Qwen2 - VL 適用於廣泛的視覺任務，但瞭解其侷限性同樣重要。以下是一些已知的限制：

缺乏音頻支持：當前模型 無法理解視頻中的音頻信息。
數據時效性：圖像數據集 更新至 2023 年 6 月，此日期之後的信息可能未涵蓋。
個體和知識產權識別受限：模型識別特定個體或知識產權的能力有限，可能無法全面覆蓋所有知名人物或品牌。
複雜指令處理能力有限：面對複雜的多步驟指令時，模型的理解和執行能力有待提高。
計數準確性不足：特別是在複雜場景中，物體計數的準確性不高，需要進一步改進。
空間推理能力較弱：特別是在 3D 空間中，模型對物體位置關係的推斷不足，難以精確判斷物體的相對位置。

這些侷限性是模型優化和改進的持續方向，團隊將致力於不斷提升模型的性能和應用範圍。

📖 引用

如果您覺得我們的工作有幫助，請引用以下文獻：

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}