🚀 AReaL: Ant Reasoning Reinforcement Learning for LLMs
AReaL is an open - source fully asynchronous reinforcement learning training system for large reasoning models. It aims to help users easily and affordably build their own AI agents, providing all necessary details for result reproduction.
🚀 Quick Start
Local Training
Train Qwen3 1.7B locally:
bash examples/run_async_ppo.sh
Evaluation
cd evaluation
python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25 \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3 - think - pure \
--temperature 1.0
✨ Features
[NEW] Asynchronous RL
With algorithm - system co - design, AReaL supports fully asynchronous RL for the fastest training! It also provides experimental support for multi - turn agentic RL.
Open & Reproducible
We continuously release all code, datasets, and training recipes for RL training of LLMs.
Scalability
AReaL can seamlessly adapt to different computational resource settings, from a single node to 1K GPUs.
Cutting - Edge Performance
AReaL can produce models with cutting - edge reasoning capabilities in math and coding. The team is also actively working on agentic tasks.
📚 Documentation
Release Highlights (AReaL - boba²)
-
Fully Asynchronous RL Training Pipeline: Achieves over 2.77x speedup without performance drop. Benchmark scripts and instructions.
-
SOTA Code Generation Model: Based on Qwen3, achieves SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.
Model (8B) |
LiveCodeBench v5 (2024.10 - 2025.2) |
Codeforces |
CodeContests |
Qwen3 - 8B |
58.8 |
1879/96.7% |
31.4 |
DeepSeek - R1 - 0528 - Qwen3 - 8B |
58.4 |
1945/97.3% |
31.0 |
[AReaL - boba² - 8B - Open](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 8B - subset) |
62.0 |
1933/97.2% |
41.4 |
[AReaL - boba² - 8B](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 8B) |
63.0 |
1962/97.5% |
40.8 |
Model (14B) |
LiveCodeBench v5 (2024.10 - 2025.2) |
Codeforces |
CodeContests |
Qwen3 - 14B |
65.4 |
1978/97.7% |
38.3 |
DeepCoder - 14B - Preview |
60.6 |
1936/95.3% |
40.1 |
[AReaL - boba² - 14B - Open](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 14B - subset) |
67.3 |
1990/97.8% |
46.2 |
[AReaL - boba² - 14B](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 14B) |
69.1 |
2044/98.2% |
46.1 |
Larger Models |
LiveCodeBench v5 (2024.10 - 2025.2) |
Codeforces |
CodeContests |
Qwen3 - 235B |
70.7 |
2056 |
- |
DeepSeek - R1 |
64.3 |
2029 |
- |
OpenAI - o3 - mini (Medium) |
66.3 |
2036 |
- |
-
Experimental Support for Multi - turn Agentic RL Training: Complete example.
Overview of Asynchronous RL Training
Synchronous RL training has inefficiencies due to waiting for the longest sequence in a batch. AReaL adopts a fully asynchronous RL training framework that decouples generation from training, improving training throughput and reducing GPU memory fragments.
SOTA Code Generation Model by AReaL - boba²
After asynchronous RL training with Qwen3 as the base model, AReaL - boba² achieves SOTA results on multiple coding benchmarks. Key features for asynchronous training are highlighted in tutorials and code walkthroughs.
RL Training for Multi - turn Agent
AReaL - boba² allows independent customization of dataset, rollout behavior, and training algorithm without modifying system - level code. A simple example of a multi - turn math agent for RL training is provided.
🔧 Technical Details
System Design
AReaL follows a system - algorithm co - design principle. On the system side, it efficiently syncs model parameters and controls sample staleness. On the algorithm side, it improves the PPO objective for stable async - RL.
Scalability
A comparison of asynchronous RL (AReaL - boba²) and classical synchronous RL (veRL) shows that AReaL has better scaling capabilities in training throughput.
📦 Resources
Quickstart
Benchmark and Reproduction
- Reproduce boba² Code Models
- Model weights: [8B - code](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 8B), [14B - code](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 14B), [8B - code - open](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 8B - subset), [14B - code - open](https://huggingface.co/inclusionAI/AReaL - boba - 2 - 14B - subset)
- Evaluation Guide
- [Training configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3 - qwen3 - code) and instructions
- Scripts for Benchmark Training Throughput
Customization Guide
System Code Walkthrough
📄 Future Plan
System Development
- [x] Support for SGLang
- [x] RL training with coding problems
- [x] Asynchronous generation and RL training
- [ ] Optimizations for distributed training: expert parallel for MOE and zero - bubble pipelining
- [ ] RL for vision - language models (VLM)
- [x] Multi - turn agentic RL
- [ ] Function calling and tool use
Algorithm Development
- [x] RL training recipes for 1.5B and 7B models
- [x] A complete RL training recipe for 32B models
- [ ] Sample - efficient multi - task RL algorithms
- [ ] Agentic capabilities with end - to - end RL
- [ ] Stable RL training for larger MOE models
📄 License
This project is licensed under the Apache - 2.0 license.
Acknowledgement
Major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University. The team also thanks the Data Intelligence Lab at Ant Research and the Super Computing Technology (SCT) team at Ant Group. We appreciate pioneering works from the community, such as [ReaLHF](https://github.com/openpsi - project/ReaLHF) and others.
Citation
@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
booktitle = {Proceedings of the Eighth Conference on Machine Learning and Systems,
MLSys 2025, Santa Clara, CA, USA, May 12 - 15, 2025},
publisher = {mlsys.org},
year = {2025},
}
@misc{fu2025areal,
title={AReaL: A Large - Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
year={2025},
eprint={2505.24298},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.24298},
}