MiMo-7B-RL-0530 Open-source Model - Free Support for Mathematical and Code Reasoning Tasks

Mimo 7B RL 0530

Developed by XiaomiMiMo

MiMo is a series of 7B parameter models trained from scratch for inference tasks. Through optimized pre-training and post-training strategies, it performs excellently in mathematical and code reasoning tasks.

Large Language Model

Transformers

Open Source License:MIT #Mathematical reasoning optimization #Code generation enhancement #Multi-token prediction

Downloads 319

Release Time : 5/30/2025

Model Overview

The MiMo series of models focuses on unlocking the reasoning potential of language models. Through innovative pre-training and post-training strategies, it outperforms many larger 32B models in mathematical and code reasoning tasks.

Model Features

Optimized pre-training strategy

Adopt a three-stage data mixing strategy and an enhanced data preprocessing process to increase the density of reasoning patterns

Innovative multi-token prediction

Introduce multi-token prediction (MTP) as an additional training objective to improve performance and accelerate inference

Reinforcement learning training

Use carefully curated mathematical and code problems for reinforcement learning training and introduce a code reward mechanism

Efficient inference infrastructure

Develop a seamless scrolling engine to support MTP and enhance the robustness of the inference engine

Model Capabilities

Solving mathematical problems

Code generation and understanding

Handling complex reasoning tasks

Answering STEM questions

General language understanding

Use Cases

Education

Solving math competition problems

Solve complex problems in math competitions such as AIME

Achieved an 80.1% Pass@1 accuracy on AIME 2024

Assisting programming education

Help students understand and generate programming code

Achieved a 60.9% Pass@1 accuracy on LiveCodeBench v5

Research

Answering scientific questions

Answer questions in scientific knowledge tests such as GPQA

Achieved a 60.6% Pass@1 accuracy on GPQA Diamond

🚀 MiMo: Unlocking the Reasoning Potential of Language Model

This project aims to unlock the reasoning potential of language models from pretraining to posttraining, offering high - performance models for mathematics and code reasoning tasks.

🚀 Quick Start

The MiMo-7B series models are now open - source. You can download the checkpoints of the base model, SFT model, RL model trained from the base model, and RL model trained from the SFT model. For detailed deployment methods, please refer to the "Deployment" section below.

✨ Features

Highlights

Pre - Training: Base Model Born for Reasoning
- Optimize the data preprocessing pipeline, enhance text extraction toolkits, and apply multi - dimensional data filtering to increase the density of reasoning patterns in pre - training data. Use multiple strategies to generate a large amount of diverse synthetic reasoning data.
- Adopt a three - stage data mixture strategy for pre - training. The MiMo - 7B - Base is pre - trained on approximately 25 trillion tokens.
- Incorporate Multiple - Token Prediction as an additional training objective to enhance model performance and accelerate inference.
Post - Training Recipe: Pioneering Reasoning Model
- Curate 130K mathematics and code problems as RL training data, which can be verified by rule - based verifiers. Each problem is carefully cleaned and its difficulty is assessed to ensure quality. Only rule - based accuracy rewards are used to avoid potential reward hacking.
- Introduce a test difficulty driven code reward to mitigate the sparse reward issue for challenging code problems. By assigning fine - grained scores for test cases with varying difficulty levels, the policy can be more effectively optimized via dense reward signals.
- Implement a data re - sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates, especially in the later phases of RL training.
RL Infrastructure
- Develop a Seamless Rollout Engine to accelerate RL training and validation. The design integrates continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time, achieving $2.29\times$ faster training and $1.96\times$ faster validation.
- Support MTP in vLLM and enhance the robustness of the inference engine in the RL system.

📦 Installation

This section mainly focuses on the deployment of the model, not the traditional installation steps. You can choose different inference methods according to your needs:

SGLang Inference

Thanks to the contribution from the SGLang team, we supported MiMo in SGLang mainstream within 24h with MTP coming soon.

Example Script

# Install the latest SGlang from main branch
python3 -m uv pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git/@main#egg=sglang&subdirectory=python"

# Launch SGLang Server
python3 -m sglang.launch_server --model - path XiaomiMiMo/MiMo - 7B - RL --host 0.0.0.0 --trust - remote - code

Detailed usage can be found in SGLang documents. MTP will also be supported in 24h.

vLLM inference

[Recommended] We officially support inference with MiMo - MTP using our fork of vLLM.

Example script

from vllm import LLM, SamplingParams

model_path = "/path/to/MiMo"
llm = LLM(
    model = model_path,
    trust_remote_code = True,
    num_speculative_tokens = 1,
    disable_log_stats = False
)
sampling_params = SamplingParams(temperature = 0.6)

conversation = [
    {
        "role": "system",
        "content": ""
    },
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

outputs = llm.chat(conversation,
                   sampling_params = sampling_params,
                   use_tqdm = False)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("=" * 80)

Or, you can register a vLLM loader for MiMo without loading MTP parameters.

You can copy the registry/register_mimo_in_vllm.py to your directory and import it with

import register_mimo_in_vllm

from vllm import LLM, SamplingParams

model_path = "/path/to/MiMo"
llm = LLM(
    model = model_path,
    trust_remote_code = True,
    # num_speculative_tokens = 1,
    disable_log_stats = False
)
sampling_params = SamplingParams(temperature = 0.6)

HuggingFace inference

Example script

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

model_id = "XiaomiMiMo/MiMo - 7B - RL - 0530"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code = True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(["Today is"], return_tensors = 'pt')
output = model.generate(**inputs, max_new_tokens = 100)
print(tokenizer.decode(output.tolist()[0]))

📚 Documentation

Updates

[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo - 7B - RL - 0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1.

Benchmark	MiMo - 7B - RL	MiMo - 7B - RL - 0530
Mathematics
MATH500 (Pass@1)	95.8	97.2
AIME 2024 (Pass@1)	68.2	80.1
AIME 2025 (Pass@1)	55.4	70.2
Code
LiveCodeBench v5 (Pass@1)	57.8	60.9
LiveCodeBench v6 (Pass@1)	49.3	52.2
STEM
GPQA - Diamond (Pass@1)	54.4	60.6
General
Alignbench1.1 (Evaluated by GPT4.1)	6.9	7.4

Model Details

The MTP layers of MiMo - 7B is tuned during pretraining and SFT and freezed during RL. With one MTP layer for speculative decoding, the acceptance rate is about 90%.

Model	Description	Download (HuggingFace)	Download (ModelScope)
MiMo - 7B - Base	Base model with extraordinary reasoning potential	[HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo - 7B - Base)	[ModelScope](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - Base)
MiMo - 7B - RL - Zero	RL model trained from base model	[HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo - 7B - RL - Zero)	[ModelScope](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - RL - Zero)
MiMo - 7B - SFT	SFT model trained from base model	[HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo - 7B - SFT)	[ModelScope](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - SFT)
MiMo - 7B - RL	RL model trained from SFT model, superior performance matching OpenAI o1 - mini	[HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo - 7B - RL)	[ModelScope](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - RL)
MiMo - 7B - RL - 0530	Advanced RL model with extended length	[HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo - 7B - RL - 0530)	[ModelScope](https://www.modelscope.cn/models/XiaomiMiMo/MiMo - 7B - RL - 0530)

Evaluation Results

Benchmark	GPT - 4o - 0513	Claude - 3.5 - Sonnet - 1022	OpenAI o1 - mini	QwQ - 32B - Preview	R1 - Distill - Qwen - 14B	R1 - Distill - Qwen - 7B	MiMo - 7B - RL
General
GPQA Diamond (Pass@1)	49.9	65.0	60.0	54.5	59.1	49.1	54.4
SuperGPQA (Pass@1)	42.4	48.2	45.2	43.6	40.6	28.9	40.5
DROP (3 - shot F1)	83.7	88.3	83.9	71.2	85.5	77.0	78.7
MMLU - Pro (EM)	72.6	78.0	80.3	52.0	68.8	53.5	58.6
IF - Eval (Prompt Strict)	84.3	86.5	84.8	40.4	78.3	60.5	61.0
Mathematics
MATH - 500 (Pass@1)	74.6	78.3	90.0	90.6	93.9	92.8	95.8
AIME 2024 (Pass@1)	9.3	16.0	63.6	50.0	69.7	55.5	68.2
AIME 2025 (Pass@1)	11.6	7.4	50.7	32.4	48.2	38.8	55.4
Code
LiveCodeBench v5 (Pass@1)	32.9	38.9	53.8	41.9	53.1	37.6	57.8
LiveCodeBench v6 (Pass@1)	30.9	37.2	46.8	39.1	31.9	23.9	49.3

MiMo - 7B series

Benchmark	MiMo - 7B - Base	MiMo - 7B - RL - Zero	MiMo - 7B - SFT	MiMo - 7B - RL	MiMo - 7B - RL - 0530
Mathematics
MATH500 (Pass@1)	37.4	93.6	93.0	95.8	97.2
AIME 2024 (Pass@1)	32.9	56.4	58.7	68.2	80.1
AIME 2025 (Pass@1)	24.3	46.3	44.3	55.4	70.2
Code
LiveCodeBench v5 (Pass@1)	32.9	49.1	52.3	57.8	60.9
LiveCodeBench v6 (Pass@1)	29.1	42.9	45.5	49.3	52.2

⚠️ Important Note

The evaluations are conducted with temperature = 0.6.

AIME24 and AIME25 are with averaged score of 32 repetitions. LiveCodeBench v5 (20240801 - 20250201), LiveCodeBench v6 (20250201 - 20250501), GPQA - Diamond and IF - Eval are with averaged score of 8 repetitions. MATH500 and SuperGPQA are with a single run.

Deployment

SGLang Inference

Thanks to the contribution from the SGLang team, we supported MiMo in SGLang mainstream within 24h with MTP coming soon. Detailed usage can be found in SGLang documents. MTP will also be supported in 24h.

vLLM inference

[Recommended] We officially support inference with MiMo - MTP using our fork of vLLM.
Or, you can register a vLLM loader for MiMo without loading MTP parameters.

HuggingFace inference

You can use the provided example script to perform inference.

Recommended environment and prompts

We recommend using our fork of vLLM which is developed based on vLLM 0.7.3.
We recommend using an empty system prompt.

💡 Usage Tip

We haven't verified MiMo with other inference engines and welcome contributions based on the model definition in the Huggingface repo.

🔧 Technical Details

Introduction

Currently, most successful RL works, including open - source research, rely on relatively large base models, e.g., 32B models, especially for enhancing code reasoning capabilities. Moreover, it was widely considered that achieving uniform and simultaneous improvements in both mathematical and code capabilities within a small model is challenging.

Nonetheless, we believe that the effectiveness of the RL trained reasoning model relies on the inherent reasoning potential of the base model. To fully unlock the reasoning potential of language models, efforts must focus not only on post - training but also on pre - training strategies tailored to reasoning.

In this work, we present MiMo - 7B, a series of models trained from scratch and born for reasoning tasks. Our RL experiments from MiMo - 7B - Base show that our model possesses extraordinary reasoning potential, even surpassing much larger 32B models. Additionally, we perform RL training on a cold - started SFT model, resulting in MiMo - 7B - RL, which demonstrates superior performance on both mathematics and code reasoning tasks, matching the performance of OpenAI o1 - mini.

Model Architecture

The MTP layers of MiMo - 7B is tuned during pretraining and SFT and freezed during RL. With one MTP layer for speculative decoding, the acceptance rate is about 90%.

RL Infrastructure

We develop a Seamless Rollout Engine to accelerate RL training and validation. Our design integrates continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time, achieving $2.29\times$ faster training and $1.96\times$ faster validation. We also support MTP in vLLM and enhance the robustness of the inference engine in the RL system.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご