OREAL-32B-SFT開源數學推理AI模型 - 免費助力複雜數學問題求解

首頁

OREAL 32B SFT

由internlm開發

OREAL-32B-SFT是基於Qwen2.5-32B的監督微調模型，專為數學推理任務設計，是OREA強化學習框架的初始策略模型。

大型語言模型

Transformers

開源協議:Apache-2.0 #數學推理強化學習 #高精度解題 #競賽級數學能力

下載量 18

發布時間 : 2/10/2025

模型概述

該模型是OREAL系列中的32B參數規模監督微調版本，主要用於數學推理任務，作為強化學習訓練的起點。

模型特點

數學推理優化

專門針對數學推理任務進行優化，能夠處理複雜的數學問題

強化學習基礎

作為OREA強化學習框架的初始策略模型，為後續強化學習訓練提供基礎

高質量監督微調

經過精心設計的監督微調過程，確保模型具備良好的初始表現

模型能力

數學問題解答

邏輯推理

多步問題求解

數學證明生成

使用案例

教育

數學競賽輔導

幫助學生解決數學競賽題目，提供分步解答

數學學習輔助

為學生提供數學問題的詳細解答和解釋

研究

強化學習研究

作為強化學習訓練的初始策略模型

🚀 OREAL-32B-SFT

OREAL-32B-SFT 是使用基於結果獎勵的強化學習（OREAL）訓練的數學推理模型系列。該模型在數學推理任務上表現出色，OREAL-7B 模型在 MATH-500 上達到 94.0 pass@1 的準確率，OREAL-32B 模型更是達到 95.0 pass@1 的準確率，超越了之前基於蒸餾訓練的 32B 模型。

基礎信息

屬性	詳情
基礎模型	Qwen/Qwen2.5 - 32B
許可證	apache - 2.0
庫名稱	transformers
任務類型	問答

⚠️ 重要提示

此模型是 OREAL RL 訓練的初始策略。

🔗 鏈接

✨ 主要特性

我們推出了 OREAL - 7B 和 OREAL - 32B 數學推理模型系列，採用基於結果獎勵的強化學習（OREAL）這一新型強化學習框架進行訓練，適用於僅提供二元結果獎勵的任務。

高性能表現：使用 OREAL 訓練的 7B 模型 在 MATH - 500 上達到了 94.0 pass@1 的準確率，與之前的 32B 模型性能相當。OREAL - 32B 進一步超越了之前基於蒸餾訓練的 32B 模型，在 MATH - 500 上的準確率達到了 95.0 pass@1。
創新方法：我們的方法利用最佳 N 採樣（BoN）進行行為克隆，並重塑負樣本獎勵以確保梯度一致性。此外，為了解決長思維鏈推理中稀疏獎勵的挑戰，我們引入了一種基於策略的標記級獎勵模型，該模型可以識別推理軌跡中的關鍵標記，用於重要性採樣。更多詳細信息，請參考我們的論文。

📊 評估結果

模型	MATH - 500	AIME2024	AIME2025 - I	LiveMath	Olympiad
API 模型
GPT - 4o - 1120	72.8	16.7	13.3	44.8	33.7
Claude - 3.5 - Sonnet - 1022	78.3	13.3	3.3	46.7	35.4
OpenAI - o1 - preview	85.5	44.6	40.0	71.0	43.6
OpenAI - o1 - mini	90.0	56.6	46.7	74.4	46.3
7B 模型
Qwen2.5 - Instrust - 7B	76.6	13.3	0.0	37.0	29.1
Qwen2.5 - Math - Instrust - 7B	81.8	20.0	13.3	44.1	31.1
rStar - Math - 7B	78.4*	26.7*	-	-	47.1*
Qwen2.5 - 7B - SimpleRL	82.4*	26.7*	-	-	37.6*
Eurus - 2 - 7B - PRIME	79.2*	26.7*	-	-	42.1*
DeepSeek - R1 - Distill - Qwen - 7B	92.8*	55.5*	40.0	65.6	64.1
OREAL - 7B	91.0	33.3	33.3	62.6	59.9
OREAL - DSR1 - Distill - Qwen - 7B	94.0	50.0	40.0	65.6	66.1
32B 模型
Qwen2.5 - Instrust - 32B	80.6	20.0	13.3	50.8	40.4
QwQ - 32B - Preview	90.6	50.0	40.0	72.7	58.5
DeepSeek - R1 - Distill - Qwen - 32B	94.3*	72.6*	46.7	67.7	71.2
OREAL - 32B	95.0	60.0	46.7	74.8	72.4

⚠️ 重要提示

以上是 OREAL 和每個基線模型的整體評估結果。OREAL - DSR1 - Distill - Qwen - 7B 表示使用 OREAL 訓練的 DeepSeek - R1 - Distill - Qwen - 7B 模型。AIME2025 - I、LiveMath 和 Olympiad 分別代表 AIME 2025 Part1、LiveMathBench 和 OlympiadBench。對於 7B 和 32B 參數規模的模型，我們分別使用粗體和斜體表示最佳和次佳性能。部分基線模型的結果直接引用自其報告，並標記為 *。我們使用 LMDeploy 進行推理，使用 OpenCompass 評估模型性能。

📦 模型集合

我們不僅發佈了 OREAL 系列的 RL 模型，還發布了 SFT 模型，希望能為社區提供幫助，推動數學推理強化學習的研究。

模型	鏈接
RL 模型
OREAL - 7B	Hugging Face
OREAL - DSR1 - Distill - Qwen - 7B	Hugging Face
OREAL - 32B	Hugging Face
SFT 模型
OREAL - 7B - SFT	Hugging Face
OREAL - 32B - SFT	Hugging Face

我們還發布了在 RL 訓練階段使用的提示數據。

數據集	鏈接
RL 提示數據	Hugging Face

💻 使用示例

基礎用法

OREAL - 7B 和 OREAL - 32B 在訓練和測試時使用系統提示來引導模型進行推理。系統提示如下：

system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi - angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem - solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."

對於 OREAL - DSR1 - Distill - Qwen - 7B，我們使用其原始模型的默認聊天模板。

高級用法

這些模型的聊天模板已經在 tokenizer_config.json 文件中設置好。可以使用 tokenizer.apply_chat_template() 函數來應用聊天模板。

question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)

📚 引用

如果您在研究中發現本工作有幫助，請考慮引用：

@article{lyu2025exploring,
  title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
  author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
  journal={arXiv preprint arXiv:2502.06781},
  year={2025}
}