MiniMax-VL-01开源多模态大模型 - 动态处理图像，视觉语言任务超实用

首页

Minimax VL 01

由 MiniMaxAI 开发

MiniMax-VL-01是一个强大的多模态大语言模型，采用'ViT-MLP-LLM'框架，具有动态分辨率处理能力，在多项视觉语言任务中表现优异。

图像生成文本

Safetensors

#动态分辨率视觉理解 #多模态大语言模型 #复杂图表解析

下载量 237

发布时间 : 1/12/2025

模型简介

该模型结合了视觉变换器(ViT)、MLP投影器和基础大语言模型，能够处理从336×336到2016×2016的动态分辨率图像输入，在多模态任务中展现出顶级性能。

模型特点

动态分辨率处理

支持从336×336到2016×2016的动态分辨率输入，保留缩略图并分割编码

大规模训练

视觉变换器在6.94亿图像-标题对上训练，共处理5120亿token

多模态能力

结合视觉和语言理解，在复杂多模态任务中表现优异

模型能力

图像理解

视觉问答

文档分析

图表理解

数学推理

科学问题解答

使用案例

教育

科学问题解答

解答包含图表和公式的科学问题

在MMMU和MMMU-Pro基准测试中表现优异

文档处理

文档问答

从文档中提取信息并回答问题

在DocVQA基准测试中达到96.4%准确率

数据分析

图表理解

分析和解释图表数据

在ChartQA基准测试中达到91.7%准确率

🚀 MiniMax-VL-01

MiniMax-VL-01 模型采用 “ViT-MLP-LLM” 框架，这是多模态大语言模型领域常用的技术。该模型通过视觉编码、图像适配和基础大语言模型三部分进行初始化和训练，具有动态分辨率特性，在多模态排行榜上达到了顶级性能，展现出在复杂多模态任务中的优势和可靠性。

🚀 快速开始

这里我们提供一个加载分词器和模型以生成内容的简单示例。

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
import torch
import json
import os
from PIL import Image

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "vision_tower",
                "image_newline",
                "multi_modal_projector",
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
        )

# set device map
model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
with open(model_safetensors_index_path, "r") as f:
    model_safetensors_index = json.load(f)
weight_map = model_safetensors_index['weight_map']
vision_map = {}
for key, value in weight_map.items():
    if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
        new_key = key.replace('.weight','').replace('.bias','')
        if new_key not in vision_map:
            vision_map[new_key] = value
# assume 8 GPUs
world_size = 8
device_map = {
    'language_model.model.embed_tokens': 'cuda:0',
    'language_model.model.norm': f'cuda:{world_size - 1}',
    'language_model.lm_head': f'cuda:{world_size - 1}'
}
for key, value in vision_map.items():
    device_map[key] = f'cuda:0'
device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
layers_per_device = hf_config.text_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load processor
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
    {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
]
prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
raw_image = Image.open("figures/image.jpg")
# tokenize and move to device
model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-VL-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)
generation_config = GenerationConfig(
    max_new_tokens=100,
    eos_token_id=200020,
    use_cache=True,
)

# generate response
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

✨ 主要特性

采用 “ViT-MLP-LLM” 框架：这是多模态大语言模型领域常用的技术，模型由三部分初始化和训练，包括一个 3 亿 300 万参数的视觉变换器（ViT）用于视觉编码、一个随机初始化的两层 MLP 投影器用于图像适配，以及 MiniMax-Text-01 作为基础大语言模型。
动态分辨率特性：输入图像会根据预设网格进行调整大小，分辨率从 336×336 到 2016×2016 不等，同时保留一个 336×336 的缩略图。调整后的图像被分割成相同大小的非重叠块，这些块和缩略图分别进行编码，然后组合以形成完整的图像表示。
大量训练数据：训练数据包括字幕、描述和指令数据。视觉变换器（ViT）在 6.94 亿个图像 - 字幕对上从头开始训练。在训练管道的四个不同阶段，总共处理了 5120 亿个标记，利用这些大量数据赋予模型强大的能力。
顶级性能：在多模态排行榜上达到了顶级性能，展示了其在复杂多模态任务中的优势和可靠性。
支持函数调用：支持函数调用功能，使模型能够智能识别何时需要调用外部函数，并以结构化的 JSON 格式输出参数。

📦 安装指南

文档未提供具体安装步骤，可参考相关依赖库的安装说明进行安装，如 transformers、torch 等。

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
import torch
import json
import os
from PIL import Image

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "vision_tower",
                "image_newline",
                "multi_modal_projector",
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
        )

# set device map
model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
with open(model_safetensors_index_path, "r") as f:
    model_safetensors_index = json.load(f)
weight_map = model_safetensors_index['weight_map']
vision_map = {}
for key, value in weight_map.items():
    if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
        new_key = key.replace('.weight','').replace('.bias','')
        if new_key not in vision_map:
            vision_map[new_key] = value
# assume 8 GPUs
world_size = 8
device_map = {
    'language_model.model.embed_tokens': 'cuda:0',
    'language_model.model.norm': f'cuda:{world_size - 1}',
    'language_model.lm_head': f'cuda:{world_size - 1}'
}
for key, value in vision_map.items():
    device_map[key] = f'cuda:0'
device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
layers_per_device = hf_config.text_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load processor
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
    {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
]
prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
raw_image = Image.open("figures/image.jpg")
# tokenize and move to device
model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-VL-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)
generation_config = GenerationConfig(
    max_new_tokens=100,
    eos_token_id=200020,
    use_cache=True,
)

# generate response
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

📚 详细文档

评估

任务	GPT - 4o (11 - 20)	Claude - 3.5 - Sonnet (10 - 22)	Gemini - 1.5 - Pro (002)	Gemini - 2.0 - Flash (exp)	Qwen2 - VL - 72B - Inst.	InternVL2.5 - 78B	LLama - 3.2 - 90B	MiniMax - VL - 01
知识
MMMU^*								68.5
MMMU - Pro^*								52.7
视觉问答
ChartQA^*_relaxed								91.7
DocVQA^*								96.4
OCRBench								865
数学与科学
AI2D^*								83.3
MathVista^*								68.6
OlympiadBench_full								24.2
长上下文
M - LongDoc_acc								32.5
综合
MEGA - Bench_macro								47.4
用户体验
In - house Benchmark								56.6

^* 按照 0-shot CoT 设置进行评估。

部署指南

对于生产部署，建议使用 vLLM 来服务 MiniMax-VL-01。vLLM 为服务大语言模型提供了出色的性能，具有以下特点：

🔥 出色的服务吞吐量性能
⚡ 高效智能的内存管理
📦 强大的批量请求处理能力
⚙️ 深度优化的底层性能

详细的部署说明，请参考 vLLM 部署指南。

函数调用

MiniMax-VL-01 支持函数调用功能，使模型能够智能识别何时需要调用外部函数，并以结构化的 JSON 格式输出参数。通过函数调用，你可以：

让模型识别用户请求中隐含的函数调用需求。
接收结构化的参数输出，以便无缝集成到应用程序中。
支持各种复杂的参数类型，包括嵌套对象和数组。

函数调用支持标准的 OpenAI 兼容格式定义，并与 Transformers 库无缝集成。详细的使用说明，请参考函数调用指南或中文指南。

引用

@misc{minimax2025minimax01scalingfoundationmodels,
      title={MiniMax-01: Scaling Foundation Models with Lightning Attention}, 
      author={MiniMax and Aonian Li and Bangwei Gong and Bo Yang and Boji Shan and Chang Liu and Cheng Zhu and Chunhao Zhang and Congchao Guo and Da Chen and Dong Li and Enwei Jiao and Gengxin Li and Guojun Zhang and Haohai Sun and Houze Dong and Jiadai Zhu and Jiaqi Zhuang and Jiayuan Song and Jin Zhu and Jingtao Han and Jingyang Li and Junbin Xie and Junhao Xu and Junjie Yan and Kaishun Zhang and Kecheng Xiao and Kexi Kang and Le Han and Leyang Wang and Lianfei Yu and Liheng Feng and Lin Zheng and Linbo Chai and Long Xing and Meizhi Ju and Mingyuan Chi and Mozhi Zhang and Peikai Huang and Pengcheng Niu and Pengfei Li and Pengyu Zhao and Qi Yang and Qidi Xu and Qiexiang Wang and Qin Wang and Qiuhui Li and Ruitao Leng and Shengmin Shi and Shuqi Yu and Sichen Li and Songquan Zhu and Tao Huang and Tianrun Liang and Weigao Sun and Weixuan Sun and Weiyu Cheng and Wenkai Li and Xiangjun Song and Xiao Su and Xiaodong Han and Xinjie Zhang and Xinzhu Hou and Xu Min and Xun Zou and Xuyang Shen and Yan Gong and Yingjie Zhu and Yipeng Zhou and Yiran Zhong and Yongyi Hu and Yuanxiang Fan and Yue Yu and Yufeng Yang and Yuhao Li and Yunan Huang and Yunji Li and Yunpeng Huang and Yunzhi Xu and Yuxin Mao and Zehan Li and Zekang Li and Zewei Tao and Zewen Ying and Zhaoyang Cong and Zhen Qin and Zhenhua Fan and Zhihang Yu and Zhuo Jiang and Zijia Wu},
      year={2025},
      eprint={2501.08313},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.08313}, 
}