mPLUG-Owl3-1B-241014 Open-Source Multimodal Large Model - Rapidly Solve the Challenge of Understanding Long Image Sequences

Mplug Owl3 1B 241014

Developed by mPLUG

mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Hyper Attention Mechanism #Long Visual Sequence Understanding #Multimodal Dialogue

Downloads 617

Release Time : 10/15/2024

Model Overview

mPLUG-Owl3 is a multimodal large language model designed to tackle the challenges of long image sequence understanding. It enhances processing speed via the Hyper Attention mechanism and can handle longer visual sequences while maintaining excellent performance in single-image, multi-image, and video tasks.

Model Features

Hyper Attention Mechanism

The Hyper Attention mechanism improves the understanding speed of long visual sequences by six times and can handle visual sequences up to eight times longer.

Multimodal Support

Supports single-image, multi-image, and video tasks with robust multimodal understanding capabilities.

Efficient Processing

Significantly enhances the efficiency of processing long visual sequences while maintaining high performance.

Model Capabilities

Image Captioning

Video Captioning

Multimodal Dialogue

Long Sequence Visual Understanding

Use Cases

Visual Question Answering

Image Captioning

Users upload an image, and the model generates a description of the image.

Produces accurate and detailed image descriptions.

Video Captioning

Users upload a video, and the model generates a description of the video.

Produces accurate and detailed video descriptions.

Multimodal Dialogue

Dialogue with Images

Users upload an image and engage in a dialogue with the model, which answers questions based on the image content.

Provides accurate answers related to the image content.

Dialogue with Videos

Users upload a video and engage in a dialogue with the model, which answers questions based on the video content.

Provides accurate answers related to the video content.

🚀 mPLUG-Owl3

mPLUG-Owl3 is a cutting - edge multi - modal large language model crafted to address the difficulties of long image sequence understanding. It introduces Hyper Attention, which accelerates the speed of long visual sequence understanding in multimodal large language models by six times, enabling the processing of visual sequences eight times longer. Moreover, it maintains outstanding performance on single - image, multi - image, and video tasks.

Github: mPLUG-Owl

🚀 Quick Start

Load the mPLUG-Owl3

We currently only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

Chat with images

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu 
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

If you find our work helpful, feel free to give us a cite.

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご