模型简介
模型特点
模型能力
使用案例
🚀 OREAL-32B-SFT
OREAL-32B-SFT 是使用基于结果奖励的强化学习(OREAL)训练的数学推理模型系列。该模型在数学推理任务上表现出色,OREAL-7B 模型在 MATH-500 上达到 94.0 pass@1 的准确率,OREAL-32B 模型更是达到 95.0 pass@1 的准确率,超越了之前基于蒸馏训练的 32B 模型。
基础信息
属性 | 详情 |
---|---|
基础模型 | Qwen/Qwen2.5 - 32B |
许可证 | apache - 2.0 |
库名称 | transformers |
任务类型 | 问答 |
⚠️ 重要提示
此模型是 OREAL RL 训练的初始策略。
🔗 链接
✨ 主要特性
我们推出了 OREAL - 7B 和 OREAL - 32B 数学推理模型系列,采用基于结果奖励的强化学习(OREAL)这一新型强化学习框架进行训练,适用于仅提供二元结果奖励的任务。
- 高性能表现:使用 OREAL 训练的 7B 模型 在 MATH - 500 上达到了 94.0 pass@1 的准确率,与之前的 32B 模型性能相当。OREAL - 32B 进一步超越了之前基于蒸馏训练的 32B 模型,在 MATH - 500 上的准确率达到了 95.0 pass@1。
- 创新方法:我们的方法利用最佳 N 采样(BoN)进行行为克隆,并重塑负样本奖励以确保梯度一致性。此外,为了解决长思维链推理中稀疏奖励的挑战,我们引入了一种基于策略的标记级奖励模型,该模型可以识别推理轨迹中的关键标记,用于重要性采样。更多详细信息,请参考我们的 论文。
📊 评估结果
模型 | MATH - 500 | AIME2024 | AIME2025 - I | LiveMath | Olympiad |
---|---|---|---|---|---|
API 模型 | |||||
GPT - 4o - 1120 | 72.8 | 16.7 | 13.3 | 44.8 | 33.7 |
Claude - 3.5 - Sonnet - 1022 | 78.3 | 13.3 | 3.3 | 46.7 | 35.4 |
OpenAI - o1 - preview | 85.5 | 44.6 | 40.0 | 71.0 | 43.6 |
OpenAI - o1 - mini | 90.0 | 56.6 | 46.7 | 74.4 | 46.3 |
7B 模型 | |||||
Qwen2.5 - Instrust - 7B | 76.6 | 13.3 | 0.0 | 37.0 | 29.1 |
Qwen2.5 - Math - Instrust - 7B | 81.8 | 20.0 | 13.3 | 44.1 | 31.1 |
rStar - Math - 7B | 78.4* | 26.7* | - | - | 47.1* |
Qwen2.5 - 7B - SimpleRL | 82.4* | 26.7* | - | - | 37.6* |
Eurus - 2 - 7B - PRIME | 79.2* | 26.7* | - | - | 42.1* |
DeepSeek - R1 - Distill - Qwen - 7B | 92.8* | 55.5* | 40.0 | 65.6 | 64.1 |
OREAL - 7B | 91.0 | 33.3 | 33.3 | 62.6 | 59.9 |
OREAL - DSR1 - Distill - Qwen - 7B | 94.0 | 50.0 | 40.0 | 65.6 | 66.1 |
32B 模型 | |||||
Qwen2.5 - Instrust - 32B | 80.6 | 20.0 | 13.3 | 50.8 | 40.4 |
QwQ - 32B - Preview | 90.6 | 50.0 | 40.0 | 72.7 | 58.5 |
DeepSeek - R1 - Distill - Qwen - 32B | 94.3* | 72.6* | 46.7 | 67.7 | 71.2 |
OREAL - 32B | 95.0 | 60.0 | 46.7 | 74.8 | 72.4 |
⚠️ 重要提示
以上是 OREAL 和每个基线模型的整体评估结果。OREAL - DSR1 - Distill - Qwen - 7B 表示使用 OREAL 训练的 DeepSeek - R1 - Distill - Qwen - 7B 模型。
AIME2025 - I
、LiveMath
和Olympiad
分别代表AIME 2025 Part1
、LiveMathBench
和OlympiadBench
。对于 7B 和 32B 参数规模的模型,我们分别使用粗体和斜体表示最佳和次佳性能。部分基线模型的结果直接引用自其报告,并标记为 *。我们使用 LMDeploy 进行推理,使用 OpenCompass 评估模型性能。
📦 模型集合
我们不仅发布了 OREAL 系列的 RL 模型,还发布了 SFT 模型,希望能为社区提供帮助,推动数学推理强化学习的研究。
模型 | 链接 |
---|---|
RL 模型 | |
OREAL - 7B | Hugging Face |
OREAL - DSR1 - Distill - Qwen - 7B | Hugging Face |
OREAL - 32B | Hugging Face |
SFT 模型 | |
OREAL - 7B - SFT | Hugging Face |
OREAL - 32B - SFT | Hugging Face |
我们还发布了在 RL 训练阶段使用的提示数据。
数据集 | 链接 |
---|---|
RL 提示数据 | Hugging Face |
💻 使用示例
基础用法
OREAL - 7B 和 OREAL - 32B 在训练和测试时使用系统提示来引导模型进行推理。系统提示如下:
system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi - angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem - solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."
对于 OREAL - DSR1 - Distill - Qwen - 7B,我们使用其原始模型的默认聊天模板。
高级用法
这些模型的聊天模板已经在 tokenizer_config.json
文件中设置好。可以使用 tokenizer.apply_chat_template()
函数来应用聊天模板。
question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)
📚 引用
如果您在研究中发现本工作有帮助,请考虑引用:
@article{lyu2025exploring,
title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
journal={arXiv preprint arXiv:2502.06781},
year={2025}
}



