mPLUG-Owl3-1B-241014オープンソース多モーダル大規模モデル - 長い画像シーケンスの難問を迅速に理解する

ホーム

Mplug Owl3 1B 241014

mPLUGによって開発

mPLUG-Owl3 は先進的なマルチモーダル大規模言語モデルで、長い画像シーケンス理解の課題に焦点を当て、超注意力メカニズムにより処理速度とシーケンス長を大幅に向上させます。

テキスト生成画像

Safetensors

英語オープンソースライセンス:Apache-2.0 #超注意力メカニズム #長い視覚シーケンス理解 #マルチモーダルダイアログ

ダウンロード数 617

リリース時間 : 10/15/2024

モデル概要

mPLUG-Owl3 はマルチモーダル大規模言語モデルで、長い画像シーケンス理解の課題を解決することを目的としています。超注意力メカニズム（Hyper Attention）により処理速度を向上させ、より長い視覚シーケンスを処理できると同時に、単一画像、複数画像、ビデオタスクで優れた性能を維持します。

モデル特徴

超注意力メカニズム

超注意力メカニズム（Hyper Attention）により、長い視覚シーケンスの理解速度を6倍に向上させ、8倍の長さの視覚シーケンスを処理できます。

マルチモーダルサポート

単一画像、複数画像、ビデオタスクをサポートし、強力なマルチモーダル理解能力を備えています。

効率的な処理

高性能を維持しながら、長い視覚シーケンスの処理効率を大幅に向上させました。

モデル能力

画像説明

ビデオ説明

マルチモーダルダイアログ

長いシーケンス視覚理解

使用事例

視覚的質問応答

画像説明

ユーザーが画像をアップロードし、モデルがその画像の説明を生成します。

正確で詳細な画像説明を生成します。

ビデオ説明

ユーザーがビデオをアップロードし、モデルがそのビデオの説明を生成します。

正確で詳細なビデオ説明を生成します。

マルチモーダルダイアログ

画像との対話

ユーザーが画像をアップロードし、モデルと対話します。モデルは画像の内容に基づいてユーザーの質問に答えます。

画像の内容に関連する正確な回答を提供します。

ビデオとの対話

ユーザーがビデオをアップロードし、モデルと対話します。モデルはビデオの内容に基づいてユーザーの質問に答えます。

ビデオの内容に関連する正確な回答を提供します。

🚀 mPLUG-Owl3

mPLUG-Owl3は、長い画像シーケンス理解の課題に取り組むために設計された最先端のマルチモーダル大規模言語モデルです。Hyper Attentionという手法を提案し、マルチモーダル大規模言語モデルにおける長いビジュアルシーケンス理解の速度を6倍に向上させ、8倍長いビジュアルシーケンスの処理を可能にします。同時に、単一画像、複数画像、ビデオタスクでも優れた性能を維持しています。

Github: mPLUG-Owl

🚀 クイックスタート

mPLUG-Owl3をロードします。現在は、attn_implementationが['sdpa', 'flash_attention_2']のみをサポートしています。

import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

画像とのチャット

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu 
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

ビデオとのチャット

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

📄 ライセンス

本プロジェクトはApache-2.0ライセンスの下で提供されています。

📚 引用

もし私たちの研究が役に立った場合は、ぜひ引用してください。

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}