Qwen2-VL开源多语言图文识别模型 - 支持全分辨率图像理解与超长视频解析

首页

Uground V1 72B Preview

由 osunlp 开发

Qwen2-VL是Qwen-VL模型系列的最新迭代，具备全分辨率图像理解、超长视频解析和多语言图文识别能力。

图像生成文本

Transformers

英语开源协议:其他 #全分辨率视觉理解 #超长视频解析 #多语言图文识别

下载量 21

发布时间 : 1/7/2025

模型简介

720亿参数的多模态视觉语言模型，支持图像理解、视频分析、多语言文本识别和智能体操作等功能。

模型特点

全分辨率图像理解

通过动态视觉token映射实现类人视觉处理体验，在MathVista、DocVQA等基准测试中达到最先进水平

超长视频理解

可解析20分钟以上视频内容，支持高质量视频问答、对话及创作

智能体操作系统

结合复杂推理与决策能力，可集成手机、机器人等设备实现视觉环境驱动的自动化操作

多语言图文理解

支持图像内多语种文本识别，涵盖主要欧洲语言、日语、韩语、阿拉伯语、越南语等

模型能力

图像理解

视频分析

多语言文本识别

智能体操作

复杂推理

决策支持

使用案例

文档处理

文档问答

解析文档图像并回答相关问题

在DocVQA测试集上达到96.5%准确率

教育

数学问题解答

解析数学图表并解答问题

在MathVista测试集上达到70.5%准确率

智能设备

安卓设备操作

通过视觉理解控制安卓设备

在AITZ基准测试中类型匹配准确率89.6%

🚀 Qwen2-VL-72B-Instruct

Qwen2-VL-72B-Instruct 是 Qwen-VL 模型的最新版本，代表了近一年的创新成果。它在视觉理解、视频处理、多模态交互等方面有显著提升，支持多语言，能处理不同分辨率和比例的图像，还可集成到移动设备和机器人中实现自动操作。

🚀 快速开始

依赖安装

Qwen2-VL 的代码已集成在最新的 Hugging face transformers 中，建议使用以下命令从源代码构建安装：

pip install git+https://github.com/huggingface/transformers

否则可能会遇到如下错误：

KeyError: 'qwen2_vl'

同时，我们提供了一个工具包 qwen-vl-utils 来更方便地处理各种类型的视觉输入，包括 base64、URL 以及交错的图像和视频。可以使用以下命令进行安装：

pip install qwen-vl-utils

代码示例

以下是一个使用 transformers 和 qwen_vl_utils 调用聊天模型的代码片段：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 模型中每张图像的视觉标记数量默认范围是 4 - 16384。可以根据需要设置 min_pixels 和 max_pixels，例如标记数量范围为 256 - 1280，以平衡速度和内存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils 的情况

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 将模型以半精度加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 图像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 预处理输入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预期输出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成输出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages1]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ 主要特性

Qwen2-VL 的新特性

关键增强功能

对各种分辨率和比例图像的最优理解：Qwen2-VL 在视觉理解基准测试中取得了最先进的性能，包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。
理解 20 分钟以上的视频：Qwen2-VL 可以理解超过 20 分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。
可操作移动设备、机器人等的智能体：凭借复杂推理和决策能力，Qwen2-VL 可以与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为服务全球用户，除了英语和中文，Qwen2-VL 现在支持理解图像内不同语言的文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新

朴素动态分辨率：与之前不同，Qwen2-VL 可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。

- **多模态旋转位置嵌入 (M - ROPE)**：将位置嵌入分解为多个部分，以捕获 1D 文本、2D 视觉和 3D 视频的位置信息，增强其多模态处理能力。

📚 详细文档

模型评估

图像基准测试

基准测试	先前最优模型 ^{(开源大视觉语言模型)}	Claude - 3.5 Sonnet	GPT - 4o	Qwen2 - VL - 72B
MMMU_val	58.3	68.3	69.1	64.5
DocVQA_test	94.1	95.2	92.8	96.5
InfoVQA_test	82.0	-	-	84.5
ChartQA_test	88.4	90.8	85.7	88.3
TextVQA_val	84.4	-	-	85.5
OCRBench	852	788	736	877
MTVQA	17.3	25.7	27.8	30.9
VCR_{en easy}	84.67	63.85	91.55	91.93
VCR_{zh easy}	22.09	1.0	14.87	65.37
RealWorldQA	72.2	60.1	75.4	77.8
MME_sum	2414.7	1920.0	2328.7	2482.7
MMBench - EN_test	86.5	79.7	83.4	86.5
MMBench - CN_test	86.3	80.7	82.1	86.6
MMBench - V1.1_test	85.5	78.5	82.2	85.9
MMT - Bench_test	63.4	-	65.5	71.7
MMStar	67.1	62.2	63.9	68.3
MMVet_{GPT - 4 - Turbo}	65.7	66.0	69.1	74.0
HallBench_avg	55.2	49.9	55.0	58.1
MathVista_testmini	67.5	67.7	63.8	70.5
MathVision	16.97	-	30.4	25.9

视频基准测试

基准测试	先前最优模型 ^{(开源大视觉语言模型)}	Gemini 1.5 - Pro	GPT - 4o	Qwen2 - VL - 72B
MVBench	69.6	-	-	73.6
PerceptionTest_test	66.9	-	-	68.0
EgoSchema_test	62.0	63.2	72.2	77.9
Video - MME _{(有无字幕)}	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8

智能体基准测试

	基准测试	指标	先前最优模型	GPT - 4o	Qwen2 - VL - 72B
通用	FnCall^[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
游戏	数轴任务	SR	89.4^[2]	91.5	100.0
	21 点游戏	SR	40.2^[2]	34.5	42.6
	EZPoint	SR	50.0^[2]	85.5	100.0
	24 点游戏	SR	2.6^[2]	3.0	4.5
安卓	AITZ	TM	83.0^[3]	70.0	89.6
		EM	47.7^[3]	35.3	72.1
AI2THOR	ALFRED_{valid - unseen}	SR	67.7^[4]	-	67.8
		GC	75.3^[4]	-	75.8
视觉语言导航	R2R_{valid - unseen}	SR	79.0	43.7^[5]	51.7
	REVERIE_{valid - unseen}	SR	61.0	31.6^[5]	31.0

SR、GC、TM 和 EM 分别是成功率、目标条件成功率、类型匹配和精确匹配的缩写。ALFRED 由 SAM^[6] 支持。

通义团队自有的函数调用基准测试
《通过强化学习将大视觉语言模型微调为决策智能体》
《Android in the Zoo: 用于 GUI 智能体的动作思维链》
《ThinkBot: 基于思维链推理的具身指令跟随》
《MapGPT: 用于视觉语言导航的自适应路径规划地图引导提示》
《Segment Anything》

多语言基准测试

模型	AR	DE	FR	IT	JA	KO	RU	TH	VI	平均
Qwen2 - VL - 72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT - 4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

🔧 技术细节

此预览模型使用 LoRA 进行了 1 个轮次的训练。另一个经过完整训练的检查点：https://huggingface.co/osunlp/UGround - V1 - 72B（在 ScreenSpot - Pro 和 ScreenSpot 上表现略好）。

有参数为 20 亿、80 亿和 720 亿的三个模型。此仓库包含经过指令微调的 720 亿参数的 Qwen2 - VL 模型。更多信息，请访问博客和 GitHub。

📄 许可证

本模型使用通义千问许可证。

⚠️ 模型局限性

虽然 Qwen2 - VL 适用于广泛的视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型 无法理解视频中的音频信息。
数据时效性：图像数据集 更新至 2023 年 6 月，此日期之后的信息可能未涵盖。
个体和知识产权识别受限：模型识别特定个体或知识产权的能力有限，可能无法全面覆盖所有知名人物或品牌。
复杂指令处理能力有限：面对复杂的多步骤指令时，模型的理解和执行能力有待提高。
计数准确性不足：特别是在复杂场景中，物体计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在 3D 空间中，模型对物体位置关系的推断不足，难以精确判断物体的相对位置。

这些局限性是模型优化和改进的持续方向，团队将致力于不断提升模型的性能和应用范围。

📖 引用

如果您觉得我们的工作有帮助，请引用以下文献：

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}