đ Open Reasoner Zero
An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
đ Quick Start
Data
We release all of curated high-quality training data in the data
folder:
- curated 129k data:
- original 57k, collected from various sources, including AIME (up to 2023), MATH, Numina-Math collection and Tulu3 MATH.
- extended 72k, mainly cleaned from OpenR1-Math-220k.
- hard 13k, mined from the first stage of ORZ-32B training.
The details for how to collect data are described in our paper.
Installation & Training Scripts
We release our Dockerfile in docker folder to facilitate the reproducibility of our training.
To install the package, run:
pip install -e .
Start ORZ-32B PPO Training
Here are the starting commands in 16 nodes.
First on master node, run:
ray start --head
then on all other nodes, run:
ray start --address='<master-node-ip>:<master-node-port>'
finally on master node, just run:
python -m playground.orz_32b_ppo
Your training log will be shown in the master node terminal.
Start ORZ-0.5B PPO Training
You can start the ORZ-0.5B PPO training in single A800/H800 node:
python -m playground.orz_0p5b_ppo
You can even run in a single A800/H800 gpu:
python -m playground.orz_0p5b_ppo_1gpu
note: since we are not in multi-node setting, no ray start
like logics are needed.
Start ORZ-7B PPO Training
Multi-node Training on 4 nodes:
ray start --head
ray start --address='<master-node-ip>:<master-node-port>'
python -m playground.orz_7b_ppo
Your training log will be shown in the master node terminal.
Start ORZ-1.5B PPO Training
Multi-node Training on 2 nodes:
ray start --head
ray start --address='<master-node-ip>:<master-node-port>'
python -m playground.orz_1p5b_ppo
Debug Settings
In the code, we leave an environment variable DEBUG_MODE
to run in debug setting for researcher to iterate. (Thought for now, we recommend using python -m playground.orz_0p5b_ppo_1gpu
for debugging.)
The debug running command examples:
DEBUG_MODE=True python -m playground.orz_14m_ppo_mini
DEBUG_MODE=True python -m playground.orz_7b_ppo
How to Use the Model
Policy Model
Policy models can be used in the same way as any chat model in transformers and vllm, since we have put the chat template jinja in the tokenizer.
Critic Model
Critic models can be loaded the same way like in the training code.
⨠Features
- Adopt single controller trainer design, flexible and researcher-friendly.
- Colocate training and generation in the same GPUs to maximize GPU utilization.
đĻ Releases
[2025/03/31]
We announce a major milestone for Open-Reasoner-Zero
:
[2025/02/18]
We release Open-Reasoner-Zero
.
As part of this release, we open-source:
đ Main Results
Figure 1 | Evaluation performance of Open-Reasoner-Zero-{7B, 32B}. Evaluation performance of Open-Reasoner-Zero-{7B, 32B} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark-requiring only a tenth of the training steps.
Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - {0.5B, 1.5B, 7B, 32B}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.
đ Overview
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.
To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI),
we release our source code, parameter settings, training data, and model weights.
Please refer to our paper for more insights across various model sizes.
Let the Reasoner-Zero tide rise!
đ Acknowledgements
đŖ Advertisement Time
We are hiring talented researchers and engineers to join our team. If you are interested in our project and would like to contribute to the reasoner scale-up all the way to AGI, please feel free to reach out to us at hanqer@stepfun.com

đē Community Discussions
We have several wechat groups to help discussions and sharing, you can scan the QR code below to join the latest group.
đ License
This project is released under the MIT license.
đ Citation
@misc{hu2025openreasonerzeroopensourceapproach,
title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model},
author={Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang and Heung-Yeung Shum},
year={2025},
eprint={2503.24290},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.24290},
}