mPLUG-Owl3-7B-240728 Open-source Multimodal Large Model - Solve Long Image Sequence Understanding, Support Image-Text and Video Tasks

Mplug Owl3 7B 240728

Developed by mPLUG

mPLUG-Owl3 is a cutting-edge multimodal large language model designed to tackle the challenges of long image sequence understanding, supporting single-image, multi-image, and video tasks.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Hyper Attention Mechanism #Long Visual Sequence Processing #Multimodal Dialogue

Downloads 4,823

Release Time : 8/12/2024

Model Overview

mPLUG-Owl3 significantly enhances long visual sequence processing capabilities through the innovative 'Hyper Attention' mechanism, supporting longer visual sequence inputs while maintaining high performance.

Model Features

Hyper Attention Mechanism

The innovative Hyper Attention technology improves long visual sequence processing speed by six times and supports processing visual sequences eight times longer.

Multimodal Understanding

Supports simultaneous understanding and analysis of image and video content, with robust cross-modal reasoning capabilities.

Efficient Inference

Supports two efficient attention implementations, sdpa and flash_attention_2, to optimize inference performance.

Model Capabilities

Image Content Description

Video Content Understanding

Multimodal Dialogue

Long Sequence Visual Processing

Use Cases

Visual Content Analysis

Image Caption Generation

Generates detailed descriptions of input images

Accurately identifies and describes objects, scenes, and relationships in images

Video Content Understanding

Analyzes video content and generates summary descriptions

Understands actions, scene changes, and key events in videos

Human-Computer Interaction

Multimodal Dialogue System

Natural language dialogue based on image or video content

Delivers smooth visually-guided conversational experiences

🚀 mPLUG-Owl3

mPLUG-Owl3 is a cutting - edge multi - modal large language model. It's designed to address the challenges of long image sequence understanding. By proposing Hyper Attention, it can boost the speed of long visual sequence understanding in multimodal large language models by sixfold and process visual sequences eight times longer. Meanwhile, it maintains excellent performance on single - image, multi - image, and video tasks.

🚀 Quick Start

Load the mPLUG - Owl3

We now only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = mPLUGOwl3Config.from_pretrained(model_path)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = mPLUGOwl3Model.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half)
model.eval().cuda()

Chat with images

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

✨ Features

mPLUG - Owl3 is a state - of - the - art multi - modal large language model for long image sequence understanding.
It proposes Hyper Attention, which can boost the speed of long visual sequence understanding by sixfold and process visual sequences eight times longer.
Maintains excellent performance on single - image, multi - image, and video tasks.

📄 License

This project is under the Apache - 2.0 license.

📚 Documentation

Github

[mPLUG - Owl](https://github.com/X - PLUG/mPLUG - Owl)

📚 Citation

If you find our work helpful, feel free to give us a cite.

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご