mPLUG-Owl3-7B-241101開源多模態大模型 - 高效解決長圖像序列理解難題

Home

Mplug Owl3 7B 241101

Developed by mPLUG

mPLUG-Owl3是一款先進的多模態大語言模型，專注於解決長圖像序列理解問題，通過超注意力機制顯著提升處理速度和序列長度支持。

文本生成圖像

Safetensors

EnglishOpen Source License:Apache-2.0 #超注意力機制 #長序列視覺理解 #多模態大模型

Downloads 302

Release Time : 11/26/2024

Model Overview

mPLUG-Owl3設計用於處理長視覺序列，支持單圖、多圖和視頻任務，具有卓越的性能表現。

Model Features

超注意力機制

將多模態大語言模型中長視覺序列理解速度提升六倍，同時支持處理八倍長度的視覺序列。

多模態支持

支持單圖、多圖和視頻任務，保持卓越性能。

優化的媒體輸入模板

新增支持多圖輸入時的圖像分割功能，採用統一運算簡化注意力計算。

Model Capabilities

長圖像序列理解

多模態問答

視頻內容分析

多圖處理

Use Cases

視頻理解

視頻問答

對視頻內容進行問答分析

在NextQA數據集上達到82.3%準確率

多圖理解

多圖推理

對多張圖片進行聯合推理

在NLVR2數據集上達到92.7%準確率

🚀 mPLUG-Owl3

mPLUG-Owl3是一款先進的多模態大語言模型，旨在應對長圖像序列理解的挑戰。它提出的超注意力機制，將多模態大語言模型中長視覺序列理解的速度提升了六倍，能夠處理長達八倍的視覺序列。同時，在單圖像、多圖像和視頻任務上均保持了出色的性能。

🚀 快速開始

加載mPLUG-Owl3

目前僅支持attn_implementation為['sdpa', 'flash_attention_2']。

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"

與圖像進行對話

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

與視頻進行對話

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to(device)
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

使用Liger-Kernel節省內存

mPLUG-Owl3基於Qwen2構建，可以通過Liger-Kernel進行優化以減少內存使用。

pip install liger-kernel

def apply_liger_kernel_to_mplug_owl3(
    rms_norm: bool = True,
    swiglu: bool = True,
    model = None,
) -> None:
    from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
    from liger_kernel.transformers.monkey_patch import _bind_method_to_module
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models

    Args:
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
        swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
        model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
        loaded. Default is None.
    """
  
    base_model = model.language_model.model

    if rms_norm:
        _patch_rms_norm_module(base_model.norm)

    for decoder_layer in base_model.layers:
        if swiglu:
            _bind_method_to_module(
                decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
            )
        if rms_norm:
            _patch_rms_norm_module(decoder_layer.input_layernorm)
            _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
    print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)

通過設置device_map節省內存

當你擁有多個GPU時，可以將device_map='auto'來將mPLUG-Owl3分割到多個GPU上。不過，這會降低推理速度。

model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]

✨ 主要特性

超融合注意力

mPLUG-Owl3原本需要分別計算交叉注意力和自注意力，並通過自適應門融合兩者的輸出。現在，使用統一操作，僅需計算一次注意力。

新的媒體輸入模板

使用以下格式表示分割後的高分辨率圖像。此外，當輸入包含多個圖像時，現在可以啟用圖像分割以獲得更好的性能，而舊版本的mPLUG-Owl3未針對這種組合進行訓練。

<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>

使用以下格式表示視頻。

<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>

調整後的媒體偏移量

之前，media_offset記錄每個標記可以看到的圖像範圍。在訓練期間，由於多個樣本的圖像會沿著批次維度拼接在一起，因此需要仔細修改media_offset，否則會指向錯誤的圖像。為防止這種情況，media_offset現在是List[List[int]]，表示樣本中每個圖像在批次中原始序列的位置。這種設計還使交叉注意力掩碼和MI-Rope的計算更加高效和方便。

所有這些更改都由處理器妥善處理，你無需更改原有的調用方式。

在視頻和多圖像場景中的高性能

模型	NextQA	MVBench	VideoMME w/o sub	LongVideoBench-val	MLVU	LVBench
mPLUG-Owl3-7B-240728	78.6	54.5	53.5	52.1	63.7	-
mPLUG-Owl3-7B-241101	82.3	59.5	59.3	59.7	70.0	43.5

模型	NLVR2	Mantis-Eval	MathVerse-mv	SciVerse-mv	BLINK	Q-Bench2
mPLUG-Owl3-7B-240728	90.8	63.1	65.0	86.2	50.3	74.0
mPLUG-Owl3-7B-241101	92.7	67.3	65.1	82.7	53.8	77.7

模型	VQAv2	OK-VQA	GQA	VizWizQA	TextVQA
mPLUG-Owl3-7B-240728	82.1	60.1	65.0	63.5	69.0
mPLUG-Owl3-7B-241101	83.2	61.4	64.7	62.9	71.4

模型	MMB-EN	MMB-CN	MM-Vet	POPE	AI2D
mPLUG-Owl3-7B-240728	77.6	74.3	40.1	88.2	73.8
mPLUG-Owl3-7B-241101	80.4	79.1	39.8	88.1	77.8

📄 許可證

本項目採用Apache-2.0許可證。

📚 詳細文檔

項目鏈接

GitHub: mPLUG-Owl

📚 引用

如果您覺得我們的工作有幫助，請引用以下文獻：

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}