mPLUG-Owl3-7B-241101开源多模态大模型 - 高效解决长图像序列理解难题

首页

Mplug Owl3 7B 241101

由 mPLUG 开发

mPLUG-Owl3是一款先进的多模态大语言模型，专注于解决长图像序列理解问题，通过超注意力机制显著提升处理速度和序列长度支持。

文本生成图像

Safetensors

英语开源协议:Apache-2.0 #超注意力机制 #长序列视觉理解 #多模态大模型

下载量 302

发布时间 : 11/26/2024

模型简介

mPLUG-Owl3设计用于处理长视觉序列，支持单图、多图和视频任务，具有卓越的性能表现。

模型特点

超注意力机制

将多模态大语言模型中长视觉序列理解速度提升六倍，同时支持处理八倍长度的视觉序列。

多模态支持

支持单图、多图和视频任务，保持卓越性能。

优化的媒体输入模板

新增支持多图输入时的图像分割功能，采用统一运算简化注意力计算。

模型能力

长图像序列理解

多模态问答

视频内容分析

多图处理

使用案例

视频理解

视频问答

对视频内容进行问答分析

在NextQA数据集上达到82.3%准确率

多图理解

多图推理

对多张图片进行联合推理

在NLVR2数据集上达到92.7%准确率

🚀 mPLUG-Owl3

mPLUG-Owl3是一款先进的多模态大语言模型，旨在应对长图像序列理解的挑战。它提出的超注意力机制，将多模态大语言模型中长视觉序列理解的速度提升了六倍，能够处理长达八倍的视觉序列。同时，在单图像、多图像和视频任务上均保持了出色的性能。

🚀 快速开始

加载mPLUG-Owl3

目前仅支持attn_implementation为['sdpa', 'flash_attention_2']。

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"

与图像进行对话

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

与视频进行对话

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to(device)
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

使用Liger-Kernel节省内存

mPLUG-Owl3基于Qwen2构建，可以通过Liger-Kernel进行优化以减少内存使用。

pip install liger-kernel

def apply_liger_kernel_to_mplug_owl3(
    rms_norm: bool = True,
    swiglu: bool = True,
    model = None,
) -> None:
    from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
    from liger_kernel.transformers.monkey_patch import _bind_method_to_module
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models

    Args:
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
        swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
        model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
        loaded. Default is None.
    """
  
    base_model = model.language_model.model

    if rms_norm:
        _patch_rms_norm_module(base_model.norm)

    for decoder_layer in base_model.layers:
        if swiglu:
            _bind_method_to_module(
                decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
            )
        if rms_norm:
            _patch_rms_norm_module(decoder_layer.input_layernorm)
            _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
    print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)

通过设置device_map节省内存

当你拥有多个GPU时，可以将device_map='auto'来将mPLUG-Owl3分割到多个GPU上。不过，这会降低推理速度。

model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]

✨ 主要特性

超融合注意力

mPLUG-Owl3原本需要分别计算交叉注意力和自注意力，并通过自适应门融合两者的输出。现在，使用统一操作，仅需计算一次注意力。

新的媒体输入模板

使用以下格式表示分割后的高分辨率图像。此外，当输入包含多个图像时，现在可以启用图像分割以获得更好的性能，而旧版本的mPLUG-Owl3未针对这种组合进行训练。

<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>

使用以下格式表示视频。

<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>

调整后的媒体偏移量

之前，media_offset记录每个标记可以看到的图像范围。在训练期间，由于多个样本的图像会沿着批次维度拼接在一起，因此需要仔细修改media_offset，否则会指向错误的图像。为防止这种情况，media_offset现在是List[List[int]]，表示样本中每个图像在批次中原始序列的位置。这种设计还使交叉注意力掩码和MI-Rope的计算更加高效和方便。

所有这些更改都由处理器妥善处理，你无需更改原有的调用方式。

在视频和多图像场景中的高性能

模型	NextQA	MVBench	VideoMME w/o sub	LongVideoBench-val	MLVU	LVBench
mPLUG-Owl3-7B-240728	78.6	54.5	53.5	52.1	63.7	-
mPLUG-Owl3-7B-241101	82.3	59.5	59.3	59.7	70.0	43.5

模型	NLVR2	Mantis-Eval	MathVerse-mv	SciVerse-mv	BLINK	Q-Bench2
mPLUG-Owl3-7B-240728	90.8	63.1	65.0	86.2	50.3	74.0
mPLUG-Owl3-7B-241101	92.7	67.3	65.1	82.7	53.8	77.7

模型	VQAv2	OK-VQA	GQA	VizWizQA	TextVQA
mPLUG-Owl3-7B-240728	82.1	60.1	65.0	63.5	69.0
mPLUG-Owl3-7B-241101	83.2	61.4	64.7	62.9	71.4

模型	MMB-EN	MMB-CN	MM-Vet	POPE	AI2D
mPLUG-Owl3-7B-240728	77.6	74.3	40.1	88.2	73.8
mPLUG-Owl3-7B-241101	80.4	79.1	39.8	88.1	77.8

📄 许可证

本项目采用Apache-2.0许可证。

📚 详细文档

项目链接

GitHub: mPLUG-Owl

📚 引用

如果您觉得我们的工作有帮助，请引用以下文献：

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}