模型概述
模型特點
模型能力
使用案例
🚀 OREAL-32B-SFT
OREAL-32B-SFT 是使用基於結果獎勵的強化學習(OREAL)訓練的數學推理模型系列。該模型在數學推理任務上表現出色,OREAL-7B 模型在 MATH-500 上達到 94.0 pass@1 的準確率,OREAL-32B 模型更是達到 95.0 pass@1 的準確率,超越了之前基於蒸餾訓練的 32B 模型。
基礎信息
屬性 | 詳情 |
---|---|
基礎模型 | Qwen/Qwen2.5 - 32B |
許可證 | apache - 2.0 |
庫名稱 | transformers |
任務類型 | 問答 |
⚠️ 重要提示
此模型是 OREAL RL 訓練的初始策略。
🔗 鏈接
✨ 主要特性
我們推出了 OREAL - 7B 和 OREAL - 32B 數學推理模型系列,採用基於結果獎勵的強化學習(OREAL)這一新型強化學習框架進行訓練,適用於僅提供二元結果獎勵的任務。
- 高性能表現:使用 OREAL 訓練的 7B 模型 在 MATH - 500 上達到了 94.0 pass@1 的準確率,與之前的 32B 模型性能相當。OREAL - 32B 進一步超越了之前基於蒸餾訓練的 32B 模型,在 MATH - 500 上的準確率達到了 95.0 pass@1。
- 創新方法:我們的方法利用最佳 N 採樣(BoN)進行行為克隆,並重塑負樣本獎勵以確保梯度一致性。此外,為了解決長思維鏈推理中稀疏獎勵的挑戰,我們引入了一種基於策略的標記級獎勵模型,該模型可以識別推理軌跡中的關鍵標記,用於重要性採樣。更多詳細信息,請參考我們的 論文。
📊 評估結果
模型 | MATH - 500 | AIME2024 | AIME2025 - I | LiveMath | Olympiad |
---|---|---|---|---|---|
API 模型 | |||||
GPT - 4o - 1120 | 72.8 | 16.7 | 13.3 | 44.8 | 33.7 |
Claude - 3.5 - Sonnet - 1022 | 78.3 | 13.3 | 3.3 | 46.7 | 35.4 |
OpenAI - o1 - preview | 85.5 | 44.6 | 40.0 | 71.0 | 43.6 |
OpenAI - o1 - mini | 90.0 | 56.6 | 46.7 | 74.4 | 46.3 |
7B 模型 | |||||
Qwen2.5 - Instrust - 7B | 76.6 | 13.3 | 0.0 | 37.0 | 29.1 |
Qwen2.5 - Math - Instrust - 7B | 81.8 | 20.0 | 13.3 | 44.1 | 31.1 |
rStar - Math - 7B | 78.4* | 26.7* | - | - | 47.1* |
Qwen2.5 - 7B - SimpleRL | 82.4* | 26.7* | - | - | 37.6* |
Eurus - 2 - 7B - PRIME | 79.2* | 26.7* | - | - | 42.1* |
DeepSeek - R1 - Distill - Qwen - 7B | 92.8* | 55.5* | 40.0 | 65.6 | 64.1 |
OREAL - 7B | 91.0 | 33.3 | 33.3 | 62.6 | 59.9 |
OREAL - DSR1 - Distill - Qwen - 7B | 94.0 | 50.0 | 40.0 | 65.6 | 66.1 |
32B 模型 | |||||
Qwen2.5 - Instrust - 32B | 80.6 | 20.0 | 13.3 | 50.8 | 40.4 |
QwQ - 32B - Preview | 90.6 | 50.0 | 40.0 | 72.7 | 58.5 |
DeepSeek - R1 - Distill - Qwen - 32B | 94.3* | 72.6* | 46.7 | 67.7 | 71.2 |
OREAL - 32B | 95.0 | 60.0 | 46.7 | 74.8 | 72.4 |
⚠️ 重要提示
以上是 OREAL 和每個基線模型的整體評估結果。OREAL - DSR1 - Distill - Qwen - 7B 表示使用 OREAL 訓練的 DeepSeek - R1 - Distill - Qwen - 7B 模型。
AIME2025 - I
、LiveMath
和Olympiad
分別代表AIME 2025 Part1
、LiveMathBench
和OlympiadBench
。對於 7B 和 32B 參數規模的模型,我們分別使用粗體和斜體表示最佳和次佳性能。部分基線模型的結果直接引用自其報告,並標記為 *。我們使用 LMDeploy 進行推理,使用 OpenCompass 評估模型性能。
📦 模型集合
我們不僅發佈了 OREAL 系列的 RL 模型,還發布了 SFT 模型,希望能為社區提供幫助,推動數學推理強化學習的研究。
模型 | 鏈接 |
---|---|
RL 模型 | |
OREAL - 7B | Hugging Face |
OREAL - DSR1 - Distill - Qwen - 7B | Hugging Face |
OREAL - 32B | Hugging Face |
SFT 模型 | |
OREAL - 7B - SFT | Hugging Face |
OREAL - 32B - SFT | Hugging Face |
我們還發布了在 RL 訓練階段使用的提示數據。
數據集 | 鏈接 |
---|---|
RL 提示數據 | Hugging Face |
💻 使用示例
基礎用法
OREAL - 7B 和 OREAL - 32B 在訓練和測試時使用系統提示來引導模型進行推理。系統提示如下:
system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi - angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem - solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."
對於 OREAL - DSR1 - Distill - Qwen - 7B,我們使用其原始模型的默認聊天模板。
高級用法
這些模型的聊天模板已經在 tokenizer_config.json
文件中設置好。可以使用 tokenizer.apply_chat_template()
函數來應用聊天模板。
question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)
📚 引用
如果您在研究中發現本工作有幫助,請考慮引用:
@article{lyu2025exploring,
title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
journal={arXiv preprint arXiv:2502.06781},
year={2025}
}



