Open - Reasoner - Zero - 7B Open Source Model - Freely Implement Large - Scale Reasoning - Oriented Reinforcement Learning, Simple and Easy to Use!

Open Reasoner Zero 7B

Developed by Open-Reasoner-Zero

Open Reasoner Zero is an open-source solution for large-scale reinforcement learning based on foundational models, focusing on scalability, simplicity, and ease of use for large-scale reasoning-oriented reinforcement learning.

Large Language Model

Transformers

Open Source License:MIT #Reinforcement Learning Reasoning #Mathematical Problem Solving #Efficient Training

Downloads 776

Release Time : 2/18/2025

Model Overview

The first open-source implementation dedicated to scalable, simple, and easy-to-use large-scale reasoning-oriented reinforcement learning, demonstrating outstanding performance across multiple benchmarks.

Model Features

Efficient Training

Achieves excellent performance with only one-tenth the training steps of the DeepSeek-R1-Zero process

Outstanding Performance

Demonstrates exceptional results on the AIME2024, MATH500, and GPQA Diamond benchmarks

Comprehensive Open Source

All source code, parameter configurations, training data, and model weights are open-sourced

Scalability

Offers model versions ranging from 0.5B to 32B

Model Capabilities

Mathematical Reasoning

Complex Problem Solving

Logical Reasoning

Reinforcement Learning

Use Cases

Academic Research

Math Competition Problem Solving

Solving complex problems in competitions like AIME

Achieves approximately 48% accuracy on the AIME2024 test

Educational Assistance

Math Learning Assistant

Helps students understand and solve complex math problems

🚀 Open Reasoner Zero

An open source approach to scaling up reinforcement learning on the base model, focusing on scalability, simplicity, and accessibility.

🚀 Quick Start

Data

We've released all our curated high - quality training data in the data folder:

Curated 129k data:
- Original 57k, collected from various sources such as AIME (up to 2023), MATH, Numina - Math collection, and Tulu3 MATH.
- Extended 72k, mainly cleaned from OpenR1 - Math - 220k.
Hard 13k, mined from the first stage of ORZ - 32B training.

For details on data collection, refer to our paper.

Installation & Training Scripts

We've provided a Dockerfile in the docker folder to ensure reproducibility of our training.

To install the package, run:

pip install -e .

Start ORZ - 32B PPO Training

Here are the commands to start training on 16 nodes:

On the master node, run:

ray start --head
# You'll see logging like:
# Next steps
#  To add another node to this Ray cluster, run
#    ray start --address='<master-node-ip>:<master-node-port>'

On all other nodes, run:

ray start --address='<master-node-ip>:<master-node-port>' # <master-node-ip> and <master-node-port> are from above loggings!

Finally, on the master node, run:

python -m playground.orz_32b_ppo

The training log will be displayed in the master node terminal.

Start ORZ - 0.5B PPO Training

You can start ORZ - 0.5B PPO training on a single A800/H800 node:

python -m playground.orz_0p5b_ppo

You can even run it on a single A800/H800 GPU:

python -m playground.orz_0p5b_ppo_1gpu

Note: Since it's not a multi - node setting, there's no need for ray start commands.

Start ORZ - 7B PPO Training

For multi - node training on 4 nodes:

# Set up for multi - node training
ray start --head # on master node
ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes

# Then on the master node, run:
python -m playground.orz_7b_ppo

The training log will be shown in the master node terminal.

Start ORZ - 1.5B PPO Training

For multi - node training on 2 nodes:

# Set up for multi - node training
ray start --head # on master node
ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes
# Then on the master node, run:
python -m playground.orz_1p5b_ppo

Debug Settings

In the code, we've left an environment variable DEBUG_MODE for researchers to run in debug mode. (Currently, we recommend using python -m playground.orz_0p5b_ppo_1gpu for debugging.)

Here are some debug running command examples:

# NOTE: just for debug, not final setting!

## Debug command in a single GPU with `EleutherAI/pythia-14m`
DEBUG_MODE=True python -m playground.orz_14m_ppo_mini
## Debug command in a single node (8 GPUs) with `Qwen/Qwen2.5-7B`
DEBUG_MODE=True python -m playground.orz_7b_ppo

How to Use the Model

Policy Model

Policy models can be used in the same way as any chat model in transformers and vllm, as we've included the chat template jinja in the tokenizer.

Critic Model

Critic models can be loaded in the same way as in the training code.

✨ Features

Single Controller Trainer Design: Adopts a single controller trainer design, which is flexible and researcher - friendly.
Maximized GPU Utilization: Colocates training and generation on the same GPUs to maximize GPU utilization.

📦 Installation

We've provided a Dockerfile in the docker folder. To install the package, run:

pip install -e .

📚 Documentation

Overview 🌊

We introduce Open - Reasoner - Zero, the first open - source implementation of large - scale reasoning - oriented RL training, emphasizing scalability, simplicity, and accessibility.

Using the same base model as DeepSeek - R1 - Zero - Qwen - 32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark. It also demonstrates remarkable efficiency, requiring only a tenth of the training steps compared to the DeepSeek - R1 - Zero pipeline.

To encourage broader participation and accelerate research towards artificial general intelligence (AGI), we've released our source code, parameter settings, training data, and model weights. For more insights across various model sizes, please refer to our paper.

Let the Reasoner - Zero tide rise!

Main Results 🏆

Figure 1 | Evaluation performance of Open - Reasoner - Zero - {7B, 32B}. Evaluation performance of Open - Reasoner - Zero - {7B, 32B} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek - R1 - Zero - Qwen - 32B, Open - Reasoner - Zero - 32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark, requiring only a tenth of the training steps.

Figure 2 | Train - time Scale up on Train Reward and Response Length of Open - Reasoner - Zero (ORZ) - {0.5B, 1.5B, 7B, 32B}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ - 32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.

Releases 📦

[2025/03/31] We've reached a major milestone for Open - Reasoner - Zero:

🌊 Updated Paper with new results.
🔭 Easy - to - use Training Scripts:
- ORZ - 1.5B training scripts and ORZ - 0.5B training scripts (main results in Figure 2).
- Minimal resource training scripts: ORZ - 0.5B can be run on a single A800/H800 GPU!
🤩 Updated Curated Datasets:
- 129k data in total:
  - original 57k data.
  - extended 72k data.
- 13k hard data mined from the above 129k data, used in the "annealing" stage of ORZ - 32B training, boosting AIME2024 from ~41% to ~48%!
🤗 More HF Models:
- Updated HF Models: Open - Reasoner - Zero - 7B and Open - Reasoner - Zero - 32B.
- Released HF Models: Open - Reasoner - Zero - 1.5B and Open - Reasoner - Zero - 0.5B.
🚀 Full Suite of Critic Models for in - depth research: Open - Reasoner - Zero - Critic -{0.5B, 1.5B, 7B, 32B}.

[2025/02/18] We released Open - Reasoner - Zero. As part of this release, we open - sourced:

🌊 Paper on our comprehensive analysis and insights in Reasoner - Zero training.
🤗 HF Model Open - Reasoner - Zero - 7B and Open - Reasoner - Zero - 32B.
🎁 Our curated 57k training data.
📄 Training Scripts for you to start your own Reasoner - Zero journey!

Acknowledgements 💖

This work was supported by the computing resources and valuable feedback from StepFun and Tsinghua University.
Our training framework is built on OpenRLHF, vllm, DeepSpeed, and ray.
Our model is based on the Qwen2.5 Series of base models, including Qwen2.5 - 0.5B, Qwen2.5 - 1.5B, Qwen2.5 - 7B, and Qwen2.5 - 32B.
We thank Project Numina, Tulu3, and OpenR1 - Math - 220k for their collected open - sourced data.

Advertisement Time 📣

We're hiring talented researchers and engineers to join our team. If you're interested in our project and want to contribute to the reasoner scale - up towards AGI, please contact us at hanqer@stepfun.com

[![Star History Chart](https://api.star - history.com/svg?repos=Open-Reasoner-Zero/Open-Reasoner-Zero&type=Timeline)](https://star - history.com/#Open-Reasoner-Zero/Open-Reasoner-Zero&Timeline)

Community Discussions 🍺

We have several WeChat groups for discussions and sharing. You can scan the QR code below to join the latest group.

WeChat Group QR Code

📄 License

The project is licensed under the MIT license.

Citation

@misc{hu2025openreasonerzeroopensourceapproach,
      title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model}, 
      author={Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang and Heung-Yeung Shum},
      year={2025},
      eprint={2503.24290},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.24290}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご