オープンソースのOvis2-1B-devマルチモーダル大規模言語モデル、高い性能でビデオや複数の画像を処理し、推論能力を強化！

Ovis2 1B Dev

Isotr0pyによって開発

Ovis2-1Bはマルチモーダル大規模言語モデル（MLLM）Ovisシリーズの最新メンバーで、視覚とテキストの埋め込み構造のアライメントに焦点を当て、小型モデルながら高性能、強化された推論能力、動画と複数画像処理、多言語OCR強化などの特徴を備えています。

テキスト生成画像

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #マルチモーダル大規模言語モデル #視覚テキストアライメント #多言語OCR強化

ダウンロード数 79

リリース時間 : 4/9/2025

モデル概要

Ovis2-1BはAIDC-AIがリリースしたマルチモーダル大規模言語モデルで、視覚とテキストの埋め込み構造のアライメントを実現することを目的としています。Ovis1.6のイテレーションアップグレードとして、Ovis2はデータ構築とトレーニング方法の両方で大幅な改善が見られ、複雑な視覚情報と多言語OCRタスクの処理に特に適しています。

モデル特徴

小型モデル高性能

トレーニング戦略を最適化することで、小規模モデルがより高い能力密度を実現し、クロスレベルでのリーディングアドバンテージを示します。

強化された推論能力

命令微調整と選好学習を組み合わせることで、思考連鎖（CoT）推論能力を大幅に強化します。

動画と複数画像処理

動画と複数画像データをトレーニングに組み込むことで、フレーム間/画像間の複雑な視覚情報処理能力を向上させます。

多言語OCR強化

英語と中国語のバイリンガルベースで多言語OCR能力を最適化し、表/グラフなどの複雑な視覚要素から構造化データを抽出する効果を向上させます。

モデル能力

画像理解

テキスト生成

動画理解

複数画像分析

多言語OCR

複雑な推論

使用事例

視覚的質問応答

画像内容の説明

入力画像を詳細に説明する

MMBench-V1.1テストセットで68.4点を達成

視覚的推論

画像内容に基づいて論理的に推論する

MathVistaテスト簡易セットで59.4点を達成

ドキュメント理解

表データ抽出

複雑な表から構造化データを抽出する

OCRBenchで89.0点を達成

動画理解

動画内容分析

動画内のアクションとシーンを理解する

VideoMME(字幕付き)で49.5点を達成

🚀 Ovis2-1B

Ovis2 は、多モーダル大規模言語モデル（MLLMs）における最新の進歩です。Ovisシリーズの革新的なアーキテクチャ設計を継承し、視覚とテキストの埋め込みを構造的にアラインさせることを目指しています。Ovis1.6の後継として、データセットの選定とトレーニング方法の両方で大幅な改善が加えられています。

🚀 クイックスタート

このセクションでは、Ovis2-1Bの基本的な使い方を説明します。以下のコードを参考に、モデルを実行することができます。

インストール

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

モデルの実行

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# single-image input
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

## cot-style input
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'

## multiple-images input
# image_paths = [
#     '/data/images/example_1.jpg',
#     '/data/images/example_2.jpg',
#     '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text

## video input (require `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
#     total_frames = int(clip.fps * clip.duration)
#     if total_frames <= num_frames:
#         sampled_indices = range(total_frames)
#     else:
#         stride = total_frames / num_frames
#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text

## text-only input
# images = []
# max_partition = None
# text = 'Hello'
# query = text

# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# generate output
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'Output:\n{output}')

バッチ推論

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# preprocess inputs
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# generate outputs
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'Output {i + 1}:\n{output}\n')

✨ 主な機能

小規模モデルの高性能化：最適化されたトレーニング戦略により、小規模モデルでも高い能力密度を達成し、クロス階層でのリーディングアドバンテージを発揮します。
強化された推論能力：命令微調整と嗜好学習の組み合わせにより、Chain-of-Thought（CoT）推論能力が大幅に強化されています。
ビデオと複数画像の処理：ビデオと複数画像のデータをトレーニングに組み込むことで、フレームや画像をまたいだ複雑な視覚情報の処理能力が向上しています。
多言語サポートとOCR：英語と中国語以外の多言語OCRが強化され、表やグラフなどの複雑な視覚要素からの構造化データ抽出も改善されています。

📦 モデル一覧

Ovis MLLMs	ViT	LLM	モデルウェイト	デモ
Ovis2-1B	aimv2-large-patch14-448	Qwen2.5-0.5B-Instruct	Huggingface	Space
Ovis2-2B	aimv2-large-patch14-448	Qwen2.5-1.5B-Instruct	Huggingface	Space
Ovis2-4B	aimv2-huge-patch14-448	Qwen2.5-3B-Instruct	Huggingface	Space
Ovis2-8B	aimv2-huge-patch14-448	Qwen2.5-7B-Instruct	Huggingface	Space
Ovis2-16B	aimv2-huge-patch14-448	Qwen2.5-14B-Instruct	Huggingface	Space
Ovis2-34B	aimv2-1B-patch14-448	Qwen2.5-32B-Instruct	Huggingface	-

📊 性能評価

Ovis2の評価には、OpenCompassのマルチモーダルおよび推論のリーダーボードで使用されているVLMEvalKitを使用しています。

image/png

画像ベンチマーク

ベンチマーク	Qwen2.5-VL-3B	SAIL-VL-2B	InternVL2.5-2B-MPO	Ovis1.6-3B	InternVL2.5-1B-MPO	Ovis2-1B	Ovis2-2B
MMBench-V1.1_test	77.1	73.6	70.7	74.1	65.8	68.4	76.9
MMStar	56.5	56.5	54.9	52.0	49.5	52.1	56.7
MMMU_val	51.4	44.1	44.6	46.7	40.3	36.1	45.6
MathVista_testmini	60.1	62.8	53.4	58.9	47.7	59.4	64.1
HallusionBench	48.7	45.9	40.7	43.8	34.8	45.2	50.2
AI2D	81.4	77.4	75.1	77.8	68.5	76.4	82.7
OCRBench	83.1	83.1	83.8	80.1	84.3	89.0	87.3
MMVet	63.2	44.2	64.2	57.6	47.2	50.0	58.3
MMBench_test	78.6	77	72.8	76.6	67.9	70.2	78.9
MMT-Bench_val	60.8	57.1	54.4	59.2	50.8	55.5	61.7
RealWorldQA	66.5	62	61.3	66.7	57	63.9	66.0
BLINK	48.4	46.4	43.8	43.8	41	44.0	47.9
QBench	74.4	72.8	69.8	75.8	63.3	71.3	76.2
ABench	75.5	74.5	71.1	75.2	67.5	71.3	76.6
MTVQA	24.9	20.2	22.6	21.1	21.7	23.7	25.6

ビデオベンチマーク

ベンチマーク	Qwen2.5-VL-3B	InternVL2.5-2B	InternVL2.5-1B	Ovis2-1B	Ovis2-2B
VideoMME(wo/w-subs)	61.5/67.6	51.9 / 54.1	50.3 / 52.3	48.6/49.5	57.2/60.8
MVBench	67.0	68.8	64.3	60.32	64.9
MLVU(M-Avg/G-Avg)	68.2/-	61.4/-	57.3/-	58.5/3.66	68.6/3.86
MMBench-Video	1.63	1.44	1.36	1.26	1.57
TempCompass	64.4	-	-	51.43	62.64

📚 詳細ドキュメント

追加の使用方法や推論ラッパー、Gradio UIについては、Ovis GitHubを参照してください。

📄 引用

Ovisが役に立った場合は、以下の論文を引用していただけると幸いです。

@article{lu2024ovis,
  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
  year={2024},
  journal={arXiv:2405.20797}
}

📄 ライセンス

このプロジェクトは、Apache License, Version 2.0（SPDX-License-Identifier: Apache-2.0）の下でライセンスされています。

⚠️ 免責事項

トレーニングプロセスではコンプライアンスチェックアルゴリズムを使用し、できる限りトレーニングされたモデルのコンプライアンスを確保しています。しかし、データの複雑性と言語モデルの使用シナリオの多様性により、モデルが著作権問題や不適切なコンテンツを完全に含まないことを保証することはできません。もし何かがあなたの権利を侵害していると感じたり、不適切なコンテンツを生成していると思われる場合は、ご連絡いただければ、速やかに対応いたします。