Minimax VL 01

MiniMaxAIによって開発

MiniMax-VL-01は強力なマルチモーダル大規模言語モデルで、'ViT-MLP-LLM'フレームワークを採用し、動的解像度処理能力を持ち、多くの視覚言語タスクで優れた性能を発揮します。

画像生成テキスト

Safetensors

#動的解像度視覚理解 #マルチモーダル大規模言語モデル #複雑な図表解析

ダウンロード数 237

リリース時間 : 1/12/2025

モデル概要

このモデルは視覚トランスフォーマー(ViT)、MLPプロジェクター、および基盤となる大規模言語モデルを組み合わせており、336×336から2016×2016までの動的解像度画像入力を処理でき、マルチモーダルタスクでトップクラスの性能を示します。

モデル特徴

動的解像度処理

336×336から2016×2016までの動的解像度入力をサポートし、サムネイルを保持しながら分割エンコードを行います

大規模トレーニング

視覚トランスフォーマーは6.94億の画像-キャプションペアでトレーニングされ、合計5120億トークンを処理しました

マルチモーダル能力

視覚と言語理解を組み合わせ、複雑なマルチモーダルタスクで優れた性能を発揮します

モデル能力

画像理解

視覚的質問応答

文書分析

図表理解

数学的推論

科学的問題解答

使用事例

教育

科学的問題解答

図表や数式を含む科学的問題に解答する

MMMUおよびMMMU-Proベンチマークで優れた成績を収めています

文書処理

文書質問応答

文書から情報を抽出し質問に答える

DocVQAベンチマークで96.4%の精度を達成

データ分析

図表理解

図表データを分析し解釈する

ChartQAベンチマークで91.7%の精度を達成

pipeline_tag: image-text-to-text

WeChat

MiniMax-VL-01

1. Introduction

We are delighted to introduce our MiniMax-VL-01 model. It adopts the "ViT-MLP-LLM" framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM. MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation. The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities. Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.

2. Evaluation

Tasks	GPT-4o (11-20)	Claude-3.5-Sonnet (10-22)	Gemini-1.5-Pro (002)	Gemini-2.0-Flash (exp)	Qwen2-VL-72B-Inst.	InternVL2.5-78B	LLama-3.2-90B	MiniMax-VL-01
Knowledge
MMMU^*	63.5	72.0	68.4	70.6	64.5	66.5	62.1	68.5
MMMU-Pro^*	54.5	54.7	50.9	57.0	43.2	47.3	36.0	52.7
Visual Q&A
ChartQA^*_relaxed	88.1	90.8	88.7	88.3	91.2	91.5	85.5	91.7
DocVQA^*	91.1	94.2	91.5	92.9	97.1	96.1	90.1	96.4
OCRBench	806	790	800	846	856	847	805	865
Mathematics & Sciences
AI2D^*	83.1	82.0	80.9	85.1	84.4	86.8	78.9	83.3
MathVista^*	62.1	65.4	70.6	73.1	69.6	68.4	57.3	68.6
OlympiadBench_full	25.2	28.4	32.1	46.1	21.9	25.1	19.3	24.2
Long Context
M-LongDoc_acc	41.4	31.4	26.2	31.4	11.6	19.7	13.9	32.5
Comprehensive
MEGA-Bench_macro	49.4	51.4	45.9	53.9	46.8	45.3	19.9	47.4
User Experience
In-house Benchmark	62.3	47.0	49.2	72.1	40.6	34.8	13.6	56.6

^* Evaluated following a 0-shot CoT setting.

3. Quickstart

Here we provide a simple example of loading the tokenizer and model to generate content.

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
import torch
import json
import os
from PIL import Image

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "vision_tower",
                "image_newline",
                "multi_modal_projector",
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
        )

# set device map
model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
with open(model_safetensors_index_path, "r") as f:
    model_safetensors_index = json.load(f)
weight_map = model_safetensors_index['weight_map']
vision_map = {}
for key, value in weight_map.items():
    if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
        new_key = key.replace('.weight','').replace('.bias','')
        if new_key not in vision_map:
            vision_map[new_key] = value
# assume 8 GPUs
world_size = 8
device_map = {
    'language_model.model.embed_tokens': 'cuda:0',
    'language_model.model.norm': f'cuda:{world_size - 1}',
    'language_model.lm_head': f'cuda:{world_size - 1}'
}
for key, value in vision_map.items():
    device_map[key] = f'cuda:0'
device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
layers_per_device = hf_config.text_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load processor
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
    {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
]
prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
raw_image = Image.open("figures/image.jpg")
# tokenize and move to device
model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-VL-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)
generation_config = GenerationConfig(
    max_new_tokens=100,
    eos_token_id=200020,
    use_cache=True,
)

# generate response
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

4. Deployment Guide

For production deployment, we recommend using vLLM to serve MiniMax-VL-01. vLLM provides excellent performance for serving large language models with the following features: 🔥 Outstanding service throughput performance
⚡ Efficient and intelligent memory management
📦 Powerful batch request processing capability
⚙️ Deeply optimized underlying performance
For detailed deployment instructions, please refer to our vLLM Deployment Guide.

5. Function Calling

MiniMax-VL-01 supports Function Calling capability, enabling the model to intelligently identify when external functions need to be called and output parameters in structured JSON format. With Function Calling, you can:

Let the model recognize implicit function call needs in user requests
Receive structured parameter outputs for seamless application integration
Support various complex parameter types, including nested objects and arrays

Function Calling supports standard OpenAI-compatible format definitions and integrates seamlessly with the Transformers library. For detailed usage instructions, please refer to our Function Call Guide or Chinese Guide.

6. Citation

@misc{minimax2025minimax01scalingfoundationmodels,
      title={MiniMax-01: Scaling Foundation Models with Lightning Attention}, 
      author={MiniMax and Aonian Li and Bangwei Gong and Bo Yang and Boji Shan and Chang Liu and Cheng Zhu and Chunhao Zhang and Congchao Guo and Da Chen and Dong Li and Enwei Jiao and Gengxin Li and Guojun Zhang and Haohai Sun and Houze Dong and Jiadai Zhu and Jiaqi Zhuang and Jiayuan Song and Jin Zhu and Jingtao Han and Jingyang Li and Junbin Xie and Junhao Xu and Junjie Yan and Kaishun Zhang and Kecheng Xiao and Kexi Kang and Le Han and Leyang Wang and Lianfei Yu and Liheng Feng and Lin Zheng and Linbo Chai and Long Xing and Meizhi Ju and Mingyuan Chi and Mozhi Zhang and Peikai Huang and Pengcheng Niu and Pengfei Li and Pengyu Zhao and Qi Yang and Qidi Xu and Qiexiang Wang and Qin Wang and Qiuhui Li and Ruitao Leng and Shengmin Shi and Shuqi Yu and Sichen Li and Songquan Zhu and Tao Huang and Tianrun Liang and Weigao Sun and Weixuan Sun and Weiyu Cheng and Wenkai Li and Xiangjun Song and Xiao Su and Xiaodong Han and Xinjie Zhang and Xinzhu Hou and Xu Min and Xun Zou and Xuyang Shen and Yan Gong and Yingjie Zhu and Yipeng Zhou and Yiran Zhong and Yongyi Hu and Yuanxiang Fan and Yue Yu and Yufeng Yang and Yuhao Li and Yunan Huang and Yunji Li and Yunpeng Huang and Yunzhi Xu and Yuxin Mao and Zehan Li and Zekang Li and Zewei Tao and Zewen Ying and Zhaoyang Cong and Zhen Qin and Zhenhua Fan and Zhihang Yu and Zhuo Jiang and Zijia Wu},
      year={2025},
      eprint={2501.08313},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.08313}, 
}

7. Chatbot & API

For general use and evaluation, we provide a Chatbot with online search capabilities and the online API for developers. For general use and evaluation, we provide the MiniMax MCP Server with video generation, image generation, speech synthesis, and voice cloning for developers.