đ mPLUG-Owl3
mPLUG-Owl3 is a cutting-edge multi-modal large language model crafted to address the challenges of long image sequence understanding. It introduces Hyper Attention, which accelerates the long visual sequence understanding speed in multimodal large language models by six times, enabling the processing of visual sequences eight times longer. Meanwhile, it maintains excellent performance on single-image, multi-image, and video tasks.
⨠Features
Fused Hyper Attention
Previously, mPLUG-Owl3 needed separate calculations for cross-attention and self-attention and fused their outputs via an adaptive gate. Now, a unified operation is used, requiring only one attention computation.
New template for media inputs
The following format is used to represent split high-resolution images. Additionally, image splitting can be enabled for multiple-image inputs to gain further performance benefits, a combination the old mPLUG-Owl3 version wasn't trained for.
<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>
The following format represents video:
<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>
Adjusted media_offset
Previously, media_offset recorded the image range each token could access. During training, as images from multiple samples were concatenated along the batch dimension, media_offset had to be carefully modified to avoid pointing to the wrong image. Now, media_offset is a List[List[int]], indicating each image's position in the original sequence within the batch. This design also streamlines the computation of cross-attention masks and MI-Rope.
High performance on video and multi-image scenario
Model |
NextQA |
MVBench |
VideoMME w/o sub |
LongVideoBench-val |
MLVU |
LVBench |
mPLUG-Owl3-7B-240728 |
78.6 |
54.5 |
53.5 |
52.1 |
63.7 |
- |
mPLUG-Owl3-7B-241101 |
82.3 |
59.5 |
59.3 |
59.7 |
70.0 |
43.5 |
Model |
NLVR2 |
Mantis-Eval |
MathVerse-mv |
SciVerse-mv |
BLINK |
Q-Bench2 |
mPLUG-Owl3-7B-240728 |
90.8 |
63.1 |
65.0 |
86.2 |
50.3 |
74.0 |
mPLUG-Owl3-7B-241101 |
92.7 |
67.3 |
65.1 |
82.7 |
53.8 |
77.7 |
Model |
VQAv2 |
OK-VQA |
GQA |
VizWizQA |
TextVQA |
mPLUG-Owl3-7B-240728 |
82.1 |
60.1 |
65.0 |
63.5 |
69.0 |
mPLUG-Owl3-7B-241101 |
83.2 |
61.4 |
64.7 |
62.9 |
71.4 |
Model |
MMB-EN |
MMB-CN |
MM-Vet |
POPE |
AI2D |
mPLUG-Owl3-7B-240728 |
77.6 |
74.3 |
40.1 |
88.2 |
73.8 |
mPLUG-Owl3-7B-241101 |
80.4 |
79.1 |
39.8 |
88.1 |
77.8 |
đ Quick Start
Load the mPLUG-Owl3
We currently only support attn_implementation
in ['sdpa', 'flash_attention_2']
.
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
Chat with images
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
Describe this image."""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=[image], videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Chat with a video
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1)
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to(device)
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Save memory by Liger-Kernel
mPLUG-Owl3 is based on Qwen2, which can be optimized via the Liger-Kernel to reduce memory usage.
pip install liger-kernel
def apply_liger_kernel_to_mplug_owl3(
rms_norm: bool = True,
swiglu: bool = True,
model = None,
) -> None:
from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
from liger_kernel.transformers.monkey_patch import _bind_method_to_module
from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
"""
Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models
Args:
rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
loaded. Default is None.
"""
base_model = model.language_model.model
if rms_norm:
_patch_rms_norm_module(base_model.norm)
for decoder_layer in base_model.layers:
if swiglu:
_bind_method_to_module(
decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
)
if rms_norm:
_patch_rms_norm_module(decoder_layer.input_layernorm)
_patch_rms_norm_module(decoder_layer.post_attention_layernorm)
print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)
Save memory by setting device_map
When you have multiple GPUs, you can set device_map='auto'
to distribute mPLUG-Owl3 across multiple GPUs. However, this will slow down the inference speed.
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]
đ License
This project is licensed under the Apache-2.0 license.
đ Documentation
Github: mPLUG-Owl
đ Citation
If you find our work helpful, feel free to cite us.
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}