đ mPLUG-Owl3
mPLUG-Owl3 is a cutting - edge multi - modal large language model. It's designed to address the challenges of long image sequence understanding. By proposing Hyper Attention, it can boost the speed of long visual sequence understanding in multimodal large language models by sixfold and process visual sequences eight times longer. Meanwhile, it maintains excellent performance on single - image, multi - image, and video tasks.
đ Quick Start
Load the mPLUG - Owl3
We now only support attn_implementation
in ['sdpa', 'flash_attention_2']
.
import torch
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = mPLUGOwl3Config.from_pretrained(model_path)
print(config)
model = mPLUGOwl3Model.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half)
model.eval().cuda()
Chat with images
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
Describe this image."""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=[image], videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Chat with a video
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1)
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
⨠Features
- mPLUG - Owl3 is a state - of - the - art multi - modal large language model for long image sequence understanding.
- It proposes Hyper Attention, which can boost the speed of long visual sequence understanding by sixfold and process visual sequences eight times longer.
- Maintains excellent performance on single - image, multi - image, and video tasks.
đ License
This project is under the Apache - 2.0 license.
đ Documentation
Github
[mPLUG - Owl](https://github.com/X - PLUG/mPLUG - Owl)
đ Citation
If you find our work helpful, feel free to give us a cite.
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}