mPLUG-Owl3-7B-241101オープンソース多模态大規模モデル - 長い画像シーケンス理解の難題を効率的に解決

ホーム

Mplug Owl3 7B 241101

mPLUGによって開発

mPLUG-Owl3は先進的なマルチモーダル大規模言語モデルで、長い画像シーケンスの理解問題に焦点を当て、超注意力メカニズムにより処理速度とシーケンス長のサポートを大幅に向上させます。

テキスト生成画像

Safetensors

英語オープンソースライセンス:Apache-2.0 #超注意力メカニズム #長シーケンス視覚理解 #マルチモーダル大規模モデル

ダウンロード数 302

リリース時間 : 11/26/2024

モデル概要

mPLUG-Owl3は長い視覚シーケンスを処理するように設計されており、単一画像、複数画像、動画タスクをサポートし、優れたパフォーマンスを発揮します。

モデル特徴

超注意力メカニズム

マルチモーダル大規模言語モデルにおける長い視覚シーケンスの理解速度を6倍に向上させ、8倍の長さの視覚シーケンスの処理をサポートします。

マルチモーダルサポート

単一画像、複数画像、動画タスクをサポートし、優れた性能を維持します。

最適化されたメディア入力テンプレート

複数画像入力時の画像分割機能を新たにサポートし、統一された演算により注意力計算を簡素化します。

モデル能力

長い画像シーケンス理解

マルチモーダル質問応答

動画コンテンツ分析

複数画像処理

使用事例

動画理解

動画質問応答

動画コンテンツに対する質問応答分析

NextQAデータセットで82.3%の精度を達成

複数画像理解

複数画像推論

複数の画像を組み合わせた推論

NLVR2データセットで92.7%の精度を達成

🚀 mPLUG-Owl3

mPLUG-Owl3は最先端のマルチモーダル大規模言語モデルで、長い画像シーケンス理解の課題に対処するように設計されています。Hyper Attentionを提案し、マルチモーダル大規模言語モデルにおける長いビジュアルシーケンス理解の速度を6倍に向上させ、8倍長いビジュアルシーケンスの処理を可能にします。同時に、単一画像、複数画像、ビデオタスクでも優れた性能を維持します。

Github: mPLUG-Owl

🚀 クイックスタート

mPLUG-Owl3をロードします。現在は、attn_implementationが['sdpa', 'flash_attention_2']のみをサポートしています。

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"

画像でチャットする例です。

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

ビデオでチャットする例です。

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to(device)
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Liger-Kernelによるメモリ節約

mPLUG-Owl3はQwen2に基づいており、Liger-Kernelを通じて最適化してメモリ使用量を削減することができます。

pip install liger-kernel

def apply_liger_kernel_to_mplug_owl3(
    rms_norm: bool = True,
    swiglu: bool = True,
    model = None,
) -> None:
    from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
    from liger_kernel.transformers.monkey_patch import _bind_method_to_module
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models

    Args:
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
        swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
        model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
        loaded. Default is None.
    """
  
    base_model = model.language_model.model

    if rms_norm:
        _patch_rms_norm_module(base_model.norm)

    for decoder_layer in base_model.layers:
        if swiglu:
            _bind_method_to_module(
                decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
            )
        if rms_norm:
            _patch_rms_norm_module(decoder_layer.input_layernorm)
            _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
    print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)

device_mapの設定によるメモリ節約

複数のGPUを持っている場合、device_map='auto'を設定することで、mPLUG-Owl3を複数のGPUに分割することができます。ただし、推論速度は低下します。

model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]

✨ 主な機能

新機能

mPLUG-Owl3-7B-241101はmPLUG-Owl3-7B-240728の改良版です。

融合Hyper Attention

mPLUG-Owl3では、クロスアテンションとセルフアテンションを別々に計算し、それらの出力を適応的ゲートを介して融合していました。現在では、一度のアテンション計算のみで済む統一操作を使用しています。

メディア入力の新しいテンプレート

分割された高解像度画像を表すために、以下の形式を使用するようになりました。また、入力が複数の画像から構成される場合に画像分割を有効にすることで、さらなる性能向上を実現できます。これは、旧バージョンのmPLUG-Owl3では対応していない組み合わせです。

<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>

ビデオを表すためには、以下の形式を使用します。

<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>

調整されたmedia_offset

以前は、media_offsetは各トークンが見ることのできる画像の範囲を記録していました。学習中に、複数のサンプルの画像がバッチ次元に沿って連結されるため、media_offsetを慎重に修正する必要がありました。そうしないと、誤った画像を指す可能性がありました。この問題を防ぐために、media_offsetは現在、List[List[int]]として、バッチ内のサンプル内の各画像の元のシーケンス内の位置を表しています。この設計により、クロスアテンションマスクとMI-Ropeの計算もより効率的かつ便利になります。

これらの変更はすべてプロセッサによって適切に処理されるため、元の呼び出し方法を変更する必要はありません。

ビデオと複数画像シナリオでの高性能

モデル	NextQA	MVBench	VideoMME w/o sub	LongVideoBench-val	MLVU	LVBench
mPLUG-Owl3-7B-240728	78.6	54.5	53.5	52.1	63.7	-
mPLUG-Owl3-7B-241101	82.3	59.5	59.3	59.7	70.0	43.5

モデル	NLVR2	Mantis-Eval	MathVerse-mv	SciVerse-mv	BLINK	Q-Bench2
mPLUG-Owl3-7B-240728	90.8	63.1	65.0	86.2	50.3	74.0
mPLUG-Owl3-7B-241101	92.7	67.3	65.1	82.7	53.8	77.7

モデル	VQAv2	OK-VQA	GQA	VizWizQA	TextVQA
mPLUG-Owl3-7B-240728	82.1	60.1	65.0	63.5	69.0
mPLUG-Owl3-7B-241101	83.2	61.4	64.7	62.9	71.4

モデル	MMB-EN	MMB-CN	MM-Vet	POPE	AI2D
mPLUG-Owl3-7B-240728	77.6	74.3	40.1	88.2	73.8
mPLUG-Owl3-7B-241101	80.4	79.1	39.8	88.1	77.8

📄 ライセンス

本プロジェクトはApache-2.0ライセンスの下で提供されています。

📚 ドキュメント

引用

本研究が役立つと感じた場合は、以下のように引用してください。

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}