Open-Reasoner-Zero-32B Open-Source Model - Freely Implement Large-Scale Reasoning-Oriented Reinforcement Learning, Easy to Use and Scalable

Open Reasoner Zero 32B

Developed by Open-Reasoner-Zero

The first open-source implementation of large-scale reasoning-oriented reinforcement learning focusing on scalability, simplicity, and ease of use

Large Language Model

Transformers

Open Source License:MIT #Mathematical Reasoning Enhancement #Multi-scale Training #Open-source Reinforcement Learning

Downloads 498

Release Time : 2/18/2025

Model Overview

Open Reasoner Zero is an open-source solution for reinforcement learning based on foundational model scaling, focusing on enhancing reasoning capabilities, suitable for high-difficulty tasks such as mathematical reasoning.

Model Features

Scalable Reinforcement Learning

Supports model training from 500M to 32B parameters, demonstrating consistent scaling capabilities

Efficient Training

Achieves or surpasses the performance of similar models with only one-tenth of the training steps

Complete Open-source

Publicly available source code, parameter settings, training data, and model weights

Resource Optimization

Provides single-GPU training solutions, lowering the research barrier

Model Capabilities

Mathematical problem solving

Complex reasoning

Multi-step problem answering

High-difficulty competition problem solving

Use Cases

Education

Math Competition Problem Solving

Solving math competition problems such as AIME

Achieved 48% accuracy on AIME2024

Math Learning Assistance

Provides step-by-step math problem solving

Research

Reinforcement Learning Research

Serves as a benchmark model for scalable reinforcement learning

🚀 Open Reasoner Zero

An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper PDF Link [WIP]👁️

🚀 Quick Start

Data

We release all of curated high-quality training data in the data folder:

curated 129k data:
- original 57k, collected from various sources, including AIME (up to 2023), MATH, Numina-Math collection and Tulu3 MATH.
- extended 72k, mainly cleaned from OpenR1-Math-220k.
hard 13k, mined from the first stage of ORZ-32B training.

The details for how to collect data are described in our paper.

Installation & Training Scripts

We release our Dockerfile in docker folder to facilitate the reproducibility of our training.

To install the package, run:

pip install -e .

Start ORZ-32B PPO Training

Here are the starting commands in 16 nodes.

First on master node, run:

ray start --head
# you will see logging like:
# Next steps
#  To add another node to this Ray cluster, run
#    ray start --address='<master-node-ip>:<master-node-port>'

then on all other nodes, run:

ray start --address='<master-node-ip>:<master-node-port>' # <master-node-ip> and <master-node-port> are from above loggings!

finally on master node, just run:

python -m playground.orz_32b_ppo

Your training log will be shown in the master node terminal.

Start ORZ-0.5B PPO Training

You can start the ORZ-0.5B PPO training in single A800/H800 node:

python -m playground.orz_0p5b_ppo

You can even run in a single A800/H800 gpu:

python -m playground.orz_0p5b_ppo_1gpu

note: since we are not in multi-node setting, no ray start like logics are needed.

Start ORZ-7B PPO Training

Multi-node Training on 4 nodes:

# set up for multi-node training
ray start --head # on master node
ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes

# then on master node, run:
python -m playground.orz_7b_ppo

Your training log will be shown in the master node terminal.

Start ORZ-1.5B PPO Training

Multi-node Training on 2 nodes:

# set up for multi-node training
ray start --head # on master node
ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes
# then on master node, run:
python -m playground.orz_1p5b_ppo

Debug Settings

In the code, we leave an environment variable DEBUG_MODE to run in debug setting for researcher to iterate. (Thought for now, we recommend using python -m playground.orz_0p5b_ppo_1gpu for debugging.)

The debug running command examples:

# NOTE: just for debug, not final setting!

## Debug command in a single GPU with `EleutherAI/pythia-14m`
DEBUG_MODE=True python -m playground.orz_14m_ppo_mini
## Debug command in a single node (8 GPUs) with `Qwen/Qwen2.5-7B`
DEBUG_MODE=True python -m playground.orz_7b_ppo

How to Use the Model

Policy Model

Policy models can be used in the same way as any chat model in transformers and vllm, since we have put the chat template jinja in the tokenizer.

Critic Model

Critic models can be loaded the same way like in the training code.

✨ Features

Adopt single controller trainer design, flexible and researcher-friendly.
Colocate training and generation in the same GPUs to maximize GPU utilization.

📦 Releases

[2025/03/31]

We announce a major milestone for Open-Reasoner-Zero:

🌊 Updated Paper with new results.
🔭 Easy-to-use Training Scripts:
- ORZ-1.5B training scripts and ORZ-0.5B training scripts (main results in Figure 2).
- Minimal resource training scripts: ORZ-0.5B can be run on a single A800/H800 gpu!
🤩 Updated Curated Datasets:
- 129k data in total:
  - original 57k data.
  - extended 72k data.
- 13k hard data mined from the above 129k data.
  - used in the "annealing" stage of ORZ-32B training: AIME2024 from ~41% to ~48%!
🤗 More HF Models:
- Updated HF Models: Open-Reasoner-Zero-7B and Open-Reasoner-Zero-32B.
- Released HF Models: Open-Reasoner-Zero-1.5B and Open-Reasoner-Zero-0.5B.
🚀 Full Suite of Critic Models for in-depth research: Open-Reasoner-Zero-Critic-{0.5B, 1.5B, 7B, 32B}.

[2025/02/18]

We release Open-Reasoner-Zero.

As part of this release, we open-source:

🌊 Paper(WIP) on our comprehensive analysis and insights in Reasoner-Zero training
🤗 HF Model Open-Reasoner-Zero-7B and Open-Reasoner-Zero-32B
🎁 Our curated 57k training data
📄 Training Scripts to enjoy your own Reasoner-Zero journey!

🏆 Main Results

Figure 1 | Evaluation performance of Open-Reasoner-Zero-{7B, 32B}. Evaluation performance of Open-Reasoner-Zero-{7B, 32B} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark-requiring only a tenth of the training steps.

Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - {0.5B, 1.5B, 7B, 32B}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.

🌊 Overview

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.

To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI), we release our source code, parameter settings, training data, and model weights. Please refer to our paper for more insights across various model sizes.

Let the Reasoner-Zero tide rise!

💖 Acknowledgements

This work was supported by computing resources and valuable feedback provided by StepFun and Tsinghua University.
Our training framework is built on OpenRLHF, vllm, DeepSpeed and ray.
Our model is based on Qwen2.5 Series of base models, including Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen2.5-7B and Qwen2.5-32B.
We thank Project Numina, Tulu3 and OpenR1-Math-220k for their collected open sourced data.

📣 Advertisement Time

We are hiring talented researchers and engineers to join our team. If you are interested in our project and would like to contribute to the reasoner scale-up all the way to AGI, please feel free to reach out to us at hanqer@stepfun.com

🍺 Community Discussions

We have several wechat groups to help discussions and sharing, you can scan the QR code below to join the latest group.

📄 License

This project is released under the MIT license.

📚 Citation

@misc{hu2025openreasonerzeroopensourceapproach,
      title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model}, 
      author={Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang and Heung-Yeung Shum},
      year={2025},
      eprint={2503.24290},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.24290}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご