mPLUG-Owl3-2B-241014: An Open-Source Multimodal Large Model - Quickly Solve the Problem of Understanding Long Image Sequences

Mplug Owl3 2B 241014

Developed by mPLUG

mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Long Image Sequence Understanding #Hyper Attention Mechanism #Multimodal Dialogue

Downloads 2,680

Release Time : 10/15/2024

Model Overview

mPLUG-Owl3 is a multimodal large language model designed to handle long image sequence understanding tasks. It enhances processing speed through the Hyper Attention mechanism and can handle longer visual sequences. The model excels in single-image, multi-image, and video tasks.

Model Features

Hyper Attention Mechanism

Through the Hyper Attention mechanism, the speed of long visual sequence understanding is increased sixfold, and it can handle visual sequences up to eight times longer.

Multimodal Support

Supports single-image, multi-image, and video tasks, with robust multimodal understanding capabilities.

Efficient Inference

The optimized architecture and implementation ensure high inference efficiency while maintaining high performance.

Model Capabilities

Visual Question Answering

Image Caption Generation

Video Caption Generation

Multimodal Dialogue

Use Cases

Visual Understanding

Image Caption Generation

Input an image, and the model can generate a detailed description.

Generates accurate and detailed image captions.

Video Caption Generation

Input a video, and the model can generate a description of the video content.

Generates coherent and accurate video captions.

Multimodal Dialogue

Dialogue with Images

Users upload an image and engage in dialogue with the model, which can answer questions based on the image content.

Provides accurate answers related to the image content.

Dialogue with Videos

Users upload a video and engage in dialogue with the model, which can answer questions based on the video content.

Provides accurate answers related to the video content.

🚀 mPLUG-Owl3

mPLUG-Owl3 is a state-of-the-art multi-modal large language model. It's designed to address the challenges of long image sequence understanding. The proposed Hyper Attention can boost the speed of long visual sequence understanding in multimodal large language models by sixfold, enabling the processing of visual sequences eight times longer. Meanwhile, it maintains excellent performance on single-image, multi-image, and video tasks.

🚀 Quick Start

Load the mPLUG-Owl3

We now only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

Chat with images

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu 
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})

g = model.generate(**inputs)
print(g)

Chat with a video

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})

g = model.generate(**inputs)
print(g)

📄 License

This project is licensed under the Apache-2.0 license.

📚 Documentation

Github: mPLUG-Owl

📖 Citation

If you find our work helpful, feel free to give us a cite.

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご