Qwen2-VLオープンソースの多言語画像文字認識モデル - 全解像度画像理解と超長ビデオ解析をサポート

ホーム

Uground V1 72B Preview

osunlpによって開発

Qwen2-VLはQwen-VLモデルシリーズの最新版で、全解像度画像理解、超長尺動画解析、多言語テキスト画像認識能力を備えています。

画像生成テキスト

Transformers

英語オープンソースライセンス:その他 #全解像度ビジュアル理解 #超長尺動画解析 #多言語テキスト画像認識

ダウンロード数 21

リリース時間 : 1/7/2025

モデル概要

720億パラメータのマルチモーダル視覚言語モデルで、画像理解、動画分析、多言語テキスト認識、エージェント操作などの機能をサポートします。

モデル特徴

全解像度画像理解

動的ビジュアルトークンマッピングにより人間のような視覚処理を実現し、MathVistaやDocVQAなどのベンチマークで最先端の性能を達成

超長尺動画理解

20分以上の動画コンテンツを解析可能で、高品質な動画Q&A、対話、創作をサポート

エージェントOS

複雑な推論と意思決定能力を統合し、スマートフォンやロボットなどのデバイスと連携して視覚環境駆動の自動操作を実現

多言語テキスト画像理解

画像内の多言語テキスト認識をサポートし、主要欧州言語、日本語、韓国語、アラビア語、ベトナム語などをカバー

モデル能力

画像理解

動画分析

多言語テキスト認識

エージェント操作

複雑推論

意思決定支援

使用事例

ドキュメント処理

ドキュメントQ&A

ドキュメント画像を解析して関連質問に回答

DocVQAテストセットで96.5%の精度を達成

教育

数学問題解答

数学チャートを解析して問題を解答

MathVistaテストセットで70.5%の精度を達成

スマートデバイス

Androidデバイス操作

視覚理解を通じてAndroidデバイスを制御

AITZベンチマークでタイプマッチング精度89.6%を達成

🚀 Qwen2-VL-72B-Instruct

Qwen-VLモデルの最新版であるQwen2-VLをご紹介します。これは約1年間の革新的な開発の成果です。

🚀 クイックスタート

このプレビューモデルは、LoRAを用いて1エポックで訓練されています。完全に訓練された別のチェックポイントはこちら（ScreenSpot-ProとScreenSpotでわずかに良好な性能を示します）。

Qwen2-VLのコードは最新のHugging face transformersに含まれています。以下のコマンドでソースからビルドすることをおすすめします。そうしないと、以下のエラーが発生する可能性があります。

KeyError: 'qwen2_vl'

pip install git+https://github.com/huggingface/transformers

様々な種類の視覚入力をより便利に扱うためのツールキットが用意されています。これには、base64、URL、画像と動画の交互入力が含まれます。以下のコマンドでインストールできます。

pip install qwen-vl-utils

以下は、transformersとqwen_vl_utilsを使用してチャットモデルを使うコードスニペットです。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

qwen_vl_utilsを使用しない場合

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

複数画像の推論

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

動画の推論

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

バッチ推論

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

その他の使用方法

入力画像には、ローカルファイル、base64、URLをサポートしています。動画については、現在はローカルファイルのみサポートしています。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

性能向上のための画像解像度

モデルは広範囲の解像度入力をサポートしています。デフォルトでは、入力にネイティブ解像度を使用しますが、より高い解像度は計算量を増やす代わりに性能を向上させることができます。ユーザーは、最小および最大ピクセル数を設定することで、速度とメモリ使用量のバランスを取ることができます。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

また、モデルへの画像サイズ入力を細かく制御するための2つの方法を提供しています。

min_pixelsとmax_pixelsを定義する：画像は、min_pixelsとmax_pixelsの範囲内でアスペクト比を維持したままリサイズされます。
正確な寸法を指定する：resized_heightとresized_widthを直接設定します。これらの値は、最も近い28の倍数に丸められます。

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

✨ 主な機能

Qwen2-VLの新機能

主要な機能強化

様々な解像度と比率の画像の最先端の理解能力：Qwen2-VLは、MathVista、DocVQA、RealWorldQA、MTVQAなどの視覚理解ベンチマークで最先端の性能を達成しています。
20分以上の動画の理解能力：Qwen2-VLは、20分以上の動画を理解し、高品質な動画ベースの質問応答、対話、コンテンツ作成などを行うことができます。
携帯電話やロボットなどのデバイスを操作するエージェント機能：複雑な推論と意思決定能力を備えたQwen2-VLは、携帯電話やロボットなどのデバイスと統合し、視覚環境とテキスト命令に基づいて自動操作を行うことができます。
多言語サポート：グローバルなユーザーに対応するため、Qwen2-VLは英語と中国語に加えて、画像内の様々な言語のテキストの理解をサポートしています。これには、ほとんどのヨーロッパ言語、日本語、韓国語、アラビア語、ベトナム語などが含まれます。

モデルアーキテクチャの更新

ナイーブな動的解像度：以前とは異なり、Qwen2-VLは任意の画像解像度を処理することができ、それを動的な数のビジュアルトークンにマッピングし、より人間に近い視覚処理体験を提供します。

マルチモーダル回転位置埋め込み（M-ROPE）：位置埋め込みを部分に分解して、1次元のテキスト、2次元のビジュアル、および3次元の動画の位置情報を捉え、マルチモーダル処理能力を強化します。

20億、80億、720億のパラメータを持つ3つのモデルがあります。このリポジトリには、命令微調整された720億パラメータのQwen2-VLモデルが含まれています。詳細については、ブログとGitHubをご覧ください。

📚 ドキュメント

画像ベンチマーク

ベンチマーク	以前の最先端 ^{(オープンソースLVLM)}	Claude-3.5 Sonnet	GPT-4o	Qwen2-VL-72B
MMMU_val	58.3	68.3	69.1	64.5
DocVQA_test	94.1	95.2	92.8	96.5
InfoVQA_test	82.0	-	-	84.5
ChartQA_test	88.4	90.8	85.7	88.3
TextVQA_val	84.4	-	-	85.5
OCRBench	852	788	736	877
MTVQA	17.3	25.7	27.8	30.9
VCR_{en easy}	84.67	63.85	91.55	91.93
VCR_{zh easy}	22.09	1.0	14.87	65.37
RealWorldQA	72.2	60.1	75.4	77.8
MME_sum	2414.7	1920.0	2328.7	2482.7
MMBench-EN_test	86.5	79.7	83.4	86.5
MMBench-CN_test	86.3	80.7	82.1	86.6
MMBench-V1.1_test	85.5	78.5	82.2	85.9
MMT-Bench_test	63.4	-	65.5	71.7
MMStar	67.1	62.2	63.9	68.3
MMVet_GPT-4-Turbo	65.7	66.0	69.1	74.0
HallBench_avg	55.2	49.9	55.0	58.1
MathVista_testmini	67.5	67.7	63.8	70.5
MathVision	16.97	-	30.4	25.9

動画ベンチマーク

ベンチマーク	以前の最先端 ^{(オープンソースLVLM)}	Gemini 1.5-Pro	GPT-4o	Qwen2-VL-72B
MVBench	69.6	-	-	73.6
PerceptionTest_test	66.9	-	-	68.0
EgoSchema_test	62.0	63.2	72.2	77.9
Video-MME _{(wo/w subs)}	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8

エージェントベンチマーク

	ベンチマーク	指標	以前の最先端	GPT-4o	Qwen2-VL-72B
一般	FnCall^[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
ゲーム	Number Line	SR	89.4^[2]	91.5	100.0
	BlackJack	SR	40.2^[2]	34.5	42.6
	EZPoint	SR	50.0^[2]	85.5	100.0
	Point24	SR	2.6^[2]	3.0	4.5
Android	AITZ	TM	83.0^[3]	70.0	89.6
		EM	47.7^[3]	35.3	72.1
AI2THOR	ALFRED_valid-unseen	SR	67.7^[4]	-	67.8
		GC	75.3^[4]	-	75.8
VLN	R2R_valid-unseen	SR	79.0	43.7^[5]	51.7
	REVERIE_valid-unseen	SR	61.0	31.6^[5]	31.0

SR、GC、TM、EMは、それぞれ成功率、目標条件成功率、タイプ一致率、正確一致率の略です。ALFREDはSAM^[6]によってサポートされています。

Qwenチームによる自作関数呼び出しベンチマーク
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Segment Anything.

多言語ベンチマーク

モデル	AR	DE	FR	IT	JA	KO	RU	TH	VI	AVG
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

🔧 技術詳細

KeyError: 'qwen2_vl'

pip install git+https://github.com/huggingface/transformers

📄 ライセンス

このプロジェクトは、tongyi-qianwenライセンスの下で提供されています。

制限事項

Qwen2-VLは幅広い視覚タスクに適用可能ですが、その制限事項を理解することも同様に重要です。以下はいくつかの既知の制限です。

音声サポートの欠如：現在のモデルは、動画内の音声情報を理解しません。
データのタイムリネス：画像データセットは2023年6月まで更新されており、この日以降の情報はカバーされていない可能性があります。
個人および知的財産（IP）の認識制限：モデルが特定の個人またはIPを認識する能力は限られており、すべての有名人やブランドを網羅できない可能性があります。
複雑な命令に対する能力制限：複雑な多段階命令に直面した場合、モデルの理解と実行能力は向上が必要です。
複雑なシーンでのカウント精度の不足：特に複雑なシーンでは、オブジェクトのカウント精度が高くなく、さらなる改善が必要です。
空間推論能力の弱点：特に3D空間では、モデルのオブジェクトの位置関係の推論が不十分であり、オブジェクトの相対位置を正確に判断することが困難です。

これらの制限事項は、モデルの最適化と改善の継続的な方向性となっており、私たちはモデルの性能と適用範囲を継続的に向上させることに取り組んでいます。

引用

もし私たちの研究が役に立った場合は、以下のように引用していただけると幸いです。

@art

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

Roberta Base Chinese Extractive Qa

RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。

質問応答システム中国語

uer

2,694

未来を切り開く、あなたのAIソリューション知識ベース

English 简体中文繁體中文にほんご