🚀 DeepCoder-1.5B-Preview
DeepCoder-1.5B-Preview is a code reasoning LLM fine - tuned from DeepSeek - R1 - Distilled - Qwen - 1.5B, aiming to democratize reinforcement learning for LLMs and scale up to long context lengths.
🚀 Quick Start
DeepCoder-1.5B-Preview is a code reasoning LLM fine - tuned from DeepSeek - R1 - Distilled - Qwen - 1.5B using distributed reinforcement learning (RL) to scale up to long context lengths.
✨ Features
- Long - Context Reasoning: Capable of handling long context lengths through distributed reinforcement learning.
- Improved Training Algorithm: Utilizes an enhanced version of GRPO (GRPO+) and iterative context lengthening.
📦 Installation
No installation steps are provided in the original README.
💻 Usage Examples
No code examples are provided in the original README.
📚 Documentation
DeepCoder Overview
DeepCoder-1.5B-Preview is a code reasoning LLM fine - tuned from DeepSeek - R1 - Distilled - Qwen - 1.5B using distributed reinforcement learning (RL) to scale up to long context lengths.
Data
Our training dataset consists of approximately 24K unique problem - tests pairs compiled from:
- Taco - Verified
- PrimeIntellect SYNTHETIC - 1
- LiveCodeBench v5 (5/1/23 - 7/31/24)
Training Recipe
Our training recipe relies on an improved version of GRPO (GRPO+) and iterative context lengthening, introduced in DeepScaleR.
GRPO+
We enhance the original GRPO algorithm with insights from DAPO to enable more stable training:
- Offline Difficulty Filtering: Instead of DAPO's online dynamic sampling with significant runtime overhead, we perform offline difficulty filtering on a subset of coding problems to ensure the training dataset remains within a suitable difficulty range.
- No Entropy Loss: We eliminate the entropy loss entirely to avoid instability caused by exponential entropy growth.
- No KL Loss: Eliminating KL loss prevents the LLM from staying within the trust region of the original SFT model and accelerates training.
- Overlong Filtering (from DAPO): To preserve long - context reasoning, we mask the loss for truncated sequences, enabling DeepCoder to generalize to 64K - context inference despite being trained with a 32K context.
- Clip High (from DAPO): By increasing the upper bound in GRPO/PPO’s surrogate loss, we encourage more exploration and more stable entropy.
Iterative Context Lengthening
Our original Deepscaler - 1.5B - Preview
scaled long context training from 8K→16K→24K, achieving 33→38→43% on AIME respectively. Similarly, Deepcoder - 14B - Preview
is trained on 16K→32K, achieving 54→58% on LiveCodeBench (v5). DeepCoder - 14B - Preview
successfully generalizes to longer contexts when evaluated at 64K context, reaching 60.6%.
DeepCoder generalizes better to long contexts than the base distilled model, due to DAPO's overlong filtering. However, its longer responses are often truncated when the max length is capped at 16K, which can lower its scores.
Model |
16K |
32K |
64K |
DeepCoder - 14B - Preview |
45.6 |
57.9 |
60.6 |
DeepSeek - R1 - Distill - Qwen - 14B |
50.2 |
53.0 |
53.0 |
A more detailed description of the training recipe can be found in our blog post.
Evaluation
We evaluate Deepcoder - 1.5B - Preview
on various coding benchmarks, including LiveCodeBench (LCBv5), Codeforces, and HumanEval+.
Model |
LCB (v5)(8/1/24 - 2/1/25) |
Codeforces Rating |
Codeforces Percentile |
HumanEval+ |
DeepCoder - 1.5B - Preview |
25.1 |
963 |
28.5 |
73.0 |
Deepseek - R1 - Distill - Qwen - 1.5B |
16.9 |
615 |
1.9 |
58.3 |
Serving DeepCoder
Our model can be served using popular high - performance inference systems:
- vLLM
- Hugging Face Text Generation Inference (TGI)
- SGLang
- TensorRT - LLM
All these systems support the OpenAI Chat Completions API format.
🔧 Technical Details
The training recipe and the improvements in GRPO+ contribute to the model's ability to handle long - context reasoning and achieve better performance on coding benchmarks.
📄 License
This project is released under the MIT License, reflecting our commitment to open and accessible AI development. We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon. This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
Acknowledgement
- Our training experiments are powered by our heavily modified fork of Verl, an open - source post - training library.
- Notably, we train 1.5B with [verl pipeline](https://github.com/agentica-project/verl - pipeline), an extension of the original verl.
- Our model is trained on top of [
DeepSeek - R1 - Distill - Qwen - 1.5B
](https://huggingface.co/deepseek - ai/DeepSeek - R1 - Distill - Qwen - 1.5B).
- Our work is done as part of Berkeley Sky Computing Lab and Berkeley AI Research.
Citation
@misc{deepcoder2025,
title={DeepCoder: A Fully Open - Source 14B Coder at O3 - mini Level},
author={Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica},
howpublished={\url{https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51}},
note={Notion Blog},
year={2025}
}