MiMo-7B-RL-Zero Open-Source Language Model - Freely Aid Mathematical and Code Reasoning Tasks

Mimo 7B RL Zero

Developed by XiaomiMiMo

MiMo-7B is a language model series launched by Xiaomi, specifically designed for reasoning tasks, including base models, SFT models, and RL models, excelling in mathematical and code reasoning tasks.

Large Language Model

Transformers

Open Source License:MIT #Reasoning Optimization #Math & Code Dual Excellence #Multi-token Prediction

Downloads 216

Release Time : 4/29/2025

Model Overview

The MiMo-7B series enhances reasoning capabilities through optimized pre-training and post-training schemes, achieving or surpassing the performance of larger-scale models in mathematical and coding tasks.

Model Features

Pre-training Optimized for Reasoning

Adopts a three-stage data mixing strategy and multi-token prediction objectives to enhance model reasoning capabilities.

Innovative Post-training Scheme

Curates math and coding problems as RL training data, introducing a test-difficulty-driven code reward mechanism.

Efficient RL Infrastructure

Develops a seamless rollout engine to accelerate RL training and validation, reducing GPU idle time.

Multi-token Prediction Support

Supports speculative decoding with an acceptance rate of ~90%, accelerating the inference process.

Model Capabilities

Mathematical problem-solving

Code generation and understanding

Complex reasoning task handling

Multi-turn dialogue

Text generation

Use Cases

Education

Math Problem Solving

Solving high school math competition-level problems

Achieves 68.2% accuracy on AIME competition questions

Programming Education

Helping students understand and generate programming code

Achieves 57.8% accuracy on LiveCodeBench tests

Software Development

Code Assistance

Assisting developers in writing and optimizing code

🚀 MiMo-7B: Unlocking the Reasoning Potential of Language Model

This project presents the MiMo-7B series of models, which are trained from scratch and designed for reasoning tasks. These models show extraordinary reasoning potential and superior performance in mathematics and code reasoning tasks.

🚀 Quick Start

This README provides an overview of the MiMo-7B series of models, including their pre - training and post - training strategies, evaluation results, and deployment methods. You can access the models on HuggingFace and ModelScope, and follow the deployment instructions to use them.

✨ Features

Pre - Training: Base Model Born for Reasoning

Optimize the data preprocessing pipeline, enhance text extraction toolkits, and apply multi - dimensional data filtering to increase reasoning pattern density in pre - training data. Generate massive diverse synthetic reasoning data using multiple strategies.
Adopt a three - stage data mixture strategy for pre - training. The MiMo - 7B - Base is pre - trained on approximately 25 trillion tokens.
Incorporate Multiple - Token Prediction as an additional training objective to enhance model performance and accelerate inference.

Post - Training Recipe: Pioneering Reasoning Model

Curate 130K mathematics and code problems as RL training data, verified by rule - based verifiers. Each problem is carefully cleaned and its difficulty is assessed. Only rule - based accuracy rewards are used to avoid potential reward hacking.
Introduce a test difficulty driven code reward to mitigate the sparse reward issue for challenging code problems. Assign fine - grained scores for test cases with varying difficulty levels to optimize the policy effectively.
Implement a data re - sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates, especially in the later phases of RL training.

RL Infrastructure

Develop a Seamless Rollout Engine to accelerate RL training and validation. Integrate continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time, achieving $2.29\times$ faster training and $1.96\times$ faster validation.
Support MTP in vLLM and enhance the robustness of the inference engine in the RL system.

📦 Installation

SGLang Inference

# Install the latest SGlang from main branch
python3 -m uv pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git/@main#egg=sglang&subdirectory=python"

# Launch SGLang Server
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL-Zero --host 0.0.0.0 --trust-remote-code

vLLM inference

[Recommended] Install our fork of vLLM:

# No specific installation command provided in the original, assume cloning the repo
git clone https://github.com/XiaomiMiMo/vllm/tree/feat_mimo_mtp_stable_073

For registering a vLLM loader without MTP parameters:

# Assume cloning the repo to get the registration script
git clone https://github.com/XiaomiMiMo/MiMo.git

HuggingFace inference

pip install transformers

💻 Usage Examples

SGLang Inference

# Install the latest SGlang from main branch
python3 -m uv pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git/@main#egg=sglang&subdirectory=python"

# Launch SGLang Server
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL-Zero --host 0.0.0.0 --trust-remote-code

Detailed usage can be found in SGLang documents. MTP will also be supported in 24h.

vLLM inference

Basic Usage

from vllm import LLM, SamplingParams

model_path = "/path/to/MiMo"
llm = LLM(
    model=model_path,
    trust_remote_code=True,
    num_speculative_tokens=1,
    disable_log_stats=False
)
sampling_params = SamplingParams(temperature=0.6)

conversation = [
    {
        "role": "system",
        "content": ""
    },
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

outputs = llm.chat(conversation,
                   sampling_params=sampling_params,
                   use_tqdm=False)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("=" * 80)

Advanced Usage

import register_mimo_in_vllm

from vllm import LLM, SamplingParams

model_path = "/path/to/MiMo"
llm = LLM(
    model=model_path,
    trust_remote_code=True,
    # num_speculative_tokens=1,
    disable_log_stats=False
)
sampling_params = SamplingParams(temperature=0.6)

HuggingFace inference

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

model_id = "XiaomiMiMo/MiMo-7B-RL-Zero"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(["Today is"], return_tensors='pt')
output = model.generate(*

📚 Documentation

Model Details

The MTP layers of MiMo - 7B are tuned during pretraining and SFT and frozen during RL. With one MTP layer for speculative decoding, the acceptance rate is about 90%.

Model	Description	Download (HuggingFace)	Download (ModelScope)
MiMo - 7B - Base	Base model with extraordinary reasoning potential	[XiaomiMiMo/MiMo - 7B - Base](https://huggingface.co/XiaomiMiMo/MiMo - 7B - Base)	[XiaomiMiMo/MiMo - 7B - Base](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - Base)
MiMo - 7B - RL - Zero	RL model trained from base model	[XiaomiMiMo/MiMo - 7B - RL - Zero](https://huggingface.co/XiaomiMiMo/MiMo - 7B - RL - Zero)	[XiaomiMiMo/MiMo - 7B - RL - Zero](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - RL - Zero)
MiMo - 7B - SFT	SFT model trained from base model	[XiaomiMiMo/MiMo - 7B - SFT](https://huggingface.co/XiaomiMiMo/MiMo - 7B - SFT)	[XiaomiMiMo/MiMo - 7B - SFT](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - SFT)
MiMo - 7B - RL	RL model trained from SFT model, superior performance matching OpenAI o1 - mini	[XiaomiMiMo/MiMo - 7B - RL](https://huggingface.co/XiaomiMiMo/MiMo - 7B - RL)	[XiaomiMiMo/MiMo - 7B - RL](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - RL)

Evaluation Results

Benchmark	GPT - 4o - 0513	Claude - 3.5 - Sonnet - 1022	OpenAI o1 - mini	QwQ - 32B - Preview	R1 - Distill - Qwen - 14B	R1 - Distill - Qwen - 7B	MiMo - 7B - RL
General
GPQA Diamond (Pass@1)	49.9	65.0	60.0	54.5	59.1	49.1	54.4
SuperGPQA (Pass@1)	42.4	48.2	45.2	43.6	40.6	28.9	40.5
DROP (3 - shot F1)	83.7	88.3	83.9	71.2	85.5	77.0	78.7
MMLU - Pro (EM)	72.6	78.0	80.3	52.0	68.8	53.5	58.6
IF - Eval (Prompt Strict)	84.3	86.5	84.8	40.4	78.3	60.5	61.0
Mathematics
MATH - 500 (Pass@1)	74.6	78.3	90.0	90.6	93.9	92.8	95.8
AIME 2024 (Pass@1)	9.3	16.0	63.6	50.0	69.7	55.5	68.2
AIME 2025 (Pass@1)	11.6	7.4	50.7	32.4	48.2	38.8	55.4
Code
LiveCodeBench v5 (Pass@1)	32.9	38.9	53.8	41.9	53.1	37.6	57.8
LiveCodeBench v6 (Pass@1)	30.9	37.2	46.8	39.1	31.9	23.9	49.3

MiMo - 7B series

Benchmark	MiMo - 7B - Base	MiMo - 7B - RL - Zero	MiMo - 7B - SFT	MiMo - 7B - RL
Mathematics
MATH500 (Pass@1)	37.4	93.6	93.0	95.8
AIME 2024 (Pass@1)	32.9	56.4	58.7	68.2
AIME 2025 (Pass@1)	24.3	46.3	44.3	55.4
Code
LiveCodeBench v5 (Pass@1)	32.9	49.1	52.3	57.8
LiveCodeBench v6 (Pass@1)	29.1	42.9	45.5	49.3

⚠️ Important Note

The evaluations are conducted with temperature = 0.6.

AIME24 and AIME25 are with averaged score of 32 repetitions. LiveCodeBench v5 (20240801 - 20250201), LiveCodeBench v6 (20250201 - 20250501), GPQA - Diamond and IF - Eval are with averaged score of 8 repetitions. MATH500 and SuperGPQA are with a single run.

📄 License

This model repository is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご