mPLUG-Owl3-7B-240728开源多模态大模型 - 解决长图像序列理解，支持图文视频任务

首页

Mplug Owl3 7B 240728

由 mPLUG 开发

mPLUG-Owl3 是一款前沿的多模态大语言模型，专为解决长图像序列理解难题而设计，支持处理单图、多图和视频任务。

文本生成图像

Safetensors

英语开源协议:Apache-2.0 #超注意力机制 #长视觉序列处理 #多模态对话

下载量 4,823

发布时间 : 8/12/2024

模型简介

mPLUG-Owl3 通过创新的'超注意力机制'（Hyper Attention）显著提升长视觉序列处理能力，支持更长的视觉序列输入并保持高性能。

模型特点

超注意力机制

创新的Hyper Attention技术将长视觉序列处理速度提升六倍，支持处理八倍长度的视觉序列。

多模态理解

同时支持图像和视频内容的理解与分析，具备强大的跨模态推理能力。

高效推理

支持sdpa和flash_attention_2两种高效注意力实现方式，优化推理性能。

模型能力

图像内容描述

视频内容理解

多模态对话

长序列视觉处理

使用案例

视觉内容分析

图像描述生成

对输入图像生成详细的内容描述

可准确识别并描述图像中的对象、场景和关系

视频内容理解

分析视频内容并生成摘要描述

能够理解视频中的动作、场景变化和关键事件

人机交互

多模态对话系统

基于图像或视频内容的自然语言对话

可实现流畅的视觉引导对话体验

🚀 mPLUG-Owl3

mPLUG-Owl3是一款先进的多模态大语言模型，旨在应对长图像序列理解的挑战。它提出了Hyper Attention技术，将多模态大语言模型中长视觉序列理解的速度提升了六倍，能够处理长度达八倍的视觉序列。同时，在单图像、多图像和视频任务上保持了出色的性能。

🚀 快速开始

加载mPLUG-Owl3

目前仅支持attn_implementation为['sdpa', 'flash_attention_2']。

import torch
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = mPLUGOwl3Config.from_pretrained(model_path)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = mPLUGOwl3Model.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half)
model.eval().cuda()

与图像进行对话

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

与视频进行对话

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用我们的论文：

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

GitHub链接：mPLUG-Owl