mPLUG-Owl3-7B-241101 Open-Source Multimodal Large Model - Efficiently Solve the Problem of Long Image Sequence Understanding

Mplug Owl3 7B 241101

Developed by mPLUG

mPLUG-Owl3 is an advanced multimodal large language model that focuses on solving the problem of long image sequence understanding. It significantly improves the processing speed and sequence length support through the hyper attention mechanism.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Hyper Attention Mechanism #Long Sequence Visual Understanding #Multimodal Large Model

Downloads 302

Release Time : 11/26/2024

Model Overview

mPLUG-Owl3 is designed to handle long visual sequences, supporting single-image, multi-image, and video tasks with excellent performance.

Model Features

Hyper Attention Mechanism

It boosts the speed of long visual sequence understanding in multimodal large language models by six times and supports processing visual sequences eight times longer.

Multimodal Support

Supports single-image, multi-image, and video tasks while maintaining excellent performance.

Optimized Media Input Template

Newly supports the image segmentation function when inputting multiple images and simplifies the attention calculation using unified operations.

Model Capabilities

Long Image Sequence Understanding

Multimodal Question Answering

Video Content Analysis

Multi-Image Processing

Use Cases

Video Understanding

Video Question Answering

Conduct question-answering analysis on video content

Achieved an accuracy of 82.3% on the NextQA dataset

Multi-Image Understanding

Multi-Image Reasoning

Conduct joint reasoning on multiple images

Achieved an accuracy of 92.7% on the NLVR2 dataset

🚀 mPLUG-Owl3

mPLUG-Owl3 is a cutting-edge multi-modal large language model crafted to address the challenges of long image sequence understanding. It introduces Hyper Attention, which accelerates the long visual sequence understanding speed in multimodal large language models by six times, enabling the processing of visual sequences eight times longer. Meanwhile, it maintains excellent performance on single-image, multi-image, and video tasks.

✨ Features

Fused Hyper Attention

Previously, mPLUG-Owl3 needed separate calculations for cross-attention and self-attention and fused their outputs via an adaptive gate. Now, a unified operation is used, requiring only one attention computation.

New template for media inputs

The following format is used to represent split high-resolution images. Additionally, image splitting can be enabled for multiple-image inputs to gain further performance benefits, a combination the old mPLUG-Owl3 version wasn't trained for.

<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>

The following format represents video:

<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>

Adjusted media_offset

Previously, media_offset recorded the image range each token could access. During training, as images from multiple samples were concatenated along the batch dimension, media_offset had to be carefully modified to avoid pointing to the wrong image. Now, media_offset is a List[List[int]], indicating each image's position in the original sequence within the batch. This design also streamlines the computation of cross-attention masks and MI-Rope.

High performance on video and multi-image scenario

Model	NextQA	MVBench	VideoMME w/o sub	LongVideoBench-val	MLVU	LVBench
mPLUG-Owl3-7B-240728	78.6	54.5	53.5	52.1	63.7	-
mPLUG-Owl3-7B-241101	82.3	59.5	59.3	59.7	70.0	43.5

Model	NLVR2	Mantis-Eval	MathVerse-mv	SciVerse-mv	BLINK	Q-Bench2
mPLUG-Owl3-7B-240728	90.8	63.1	65.0	86.2	50.3	74.0
mPLUG-Owl3-7B-241101	92.7	67.3	65.1	82.7	53.8	77.7

Model	VQAv2	OK-VQA	GQA	VizWizQA	TextVQA
mPLUG-Owl3-7B-240728	82.1	60.1	65.0	63.5	69.0
mPLUG-Owl3-7B-241101	83.2	61.4	64.7	62.9	71.4

Model	MMB-EN	MMB-CN	MM-Vet	POPE	AI2D
mPLUG-Owl3-7B-240728	77.6	74.3	40.1	88.2	73.8
mPLUG-Owl3-7B-241101	80.4	79.1	39.8	88.1	77.8

🚀 Quick Start

Load the mPLUG-Owl3

We currently only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"

Chat with images

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to(device)
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Save memory by Liger-Kernel

mPLUG-Owl3 is based on Qwen2, which can be optimized via the Liger-Kernel to reduce memory usage.

pip install liger-kernel

def apply_liger_kernel_to_mplug_owl3(
    rms_norm: bool = True,
    swiglu: bool = True,
    model = None,
) -> None:
    from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
    from liger_kernel.transformers.monkey_patch import _bind_method_to_module
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models

    Args:
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
        swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
        model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
        loaded. Default is None.
    """
  
    base_model = model.language_model.model

    if rms_norm:
        _patch_rms_norm_module(base_model.norm)

    for decoder_layer in base_model.layers:
        if swiglu:
            _bind_method_to_module(
                decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
            )
        if rms_norm:
            _patch_rms_norm_module(decoder_layer.input_layernorm)
            _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
    print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)

Save memory by setting device_map

When you have multiple GPUs, you can set device_map='auto' to distribute mPLUG-Owl3 across multiple GPUs. However, this will slow down the inference speed.

model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]

📄 License

This project is licensed under the Apache-2.0 license.

📚 Documentation

Github: mPLUG-Owl

📚 Citation

If you find our work helpful, feel free to cite us.

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご