OREAL-32B-SFT开源数学推理AI模型 - 免费助力复杂数学问题求解

首页

OREAL 32B SFT

由 internlm 开发

OREAL-32B-SFT是基于Qwen2.5-32B的监督微调模型，专为数学推理任务设计，是OREA强化学习框架的初始策略模型。

大型语言模型

Transformers

开源协议:Apache-2.0 #数学推理强化学习 #高精度解题 #竞赛级数学能力

下载量 18

发布时间 : 2/10/2025

模型简介

该模型是OREAL系列中的32B参数规模监督微调版本，主要用于数学推理任务，作为强化学习训练的起点。

模型特点

数学推理优化

专门针对数学推理任务进行优化，能够处理复杂的数学问题

强化学习基础

作为OREA强化学习框架的初始策略模型，为后续强化学习训练提供基础

高质量监督微调

经过精心设计的监督微调过程，确保模型具备良好的初始表现

模型能力

数学问题解答

逻辑推理

多步问题求解

数学证明生成

使用案例

教育

数学竞赛辅导

帮助学生解决数学竞赛题目，提供分步解答

数学学习辅助

为学生提供数学问题的详细解答和解释

研究

强化学习研究

作为强化学习训练的初始策略模型

🚀 OREAL-32B-SFT

OREAL-32B-SFT 是使用基于结果奖励的强化学习（OREAL）训练的数学推理模型系列。该模型在数学推理任务上表现出色，OREAL-7B 模型在 MATH-500 上达到 94.0 pass@1 的准确率，OREAL-32B 模型更是达到 95.0 pass@1 的准确率，超越了之前基于蒸馏训练的 32B 模型。

基础信息

属性	详情
基础模型	Qwen/Qwen2.5 - 32B
许可证	apache - 2.0
库名称	transformers
任务类型	问答

⚠️ 重要提示

此模型是 OREAL RL 训练的初始策略。

🔗 链接

✨ 主要特性

我们推出了 OREAL - 7B 和 OREAL - 32B 数学推理模型系列，采用基于结果奖励的强化学习（OREAL）这一新型强化学习框架进行训练，适用于仅提供二元结果奖励的任务。

高性能表现：使用 OREAL 训练的 7B 模型 在 MATH - 500 上达到了 94.0 pass@1 的准确率，与之前的 32B 模型性能相当。OREAL - 32B 进一步超越了之前基于蒸馏训练的 32B 模型，在 MATH - 500 上的准确率达到了 95.0 pass@1。
创新方法：我们的方法利用最佳 N 采样（BoN）进行行为克隆，并重塑负样本奖励以确保梯度一致性。此外，为了解决长思维链推理中稀疏奖励的挑战，我们引入了一种基于策略的标记级奖励模型，该模型可以识别推理轨迹中的关键标记，用于重要性采样。更多详细信息，请参考我们的论文。

📊 评估结果

模型	MATH - 500	AIME2024	AIME2025 - I	LiveMath	Olympiad
API 模型
GPT - 4o - 1120	72.8	16.7	13.3	44.8	33.7
Claude - 3.5 - Sonnet - 1022	78.3	13.3	3.3	46.7	35.4
OpenAI - o1 - preview	85.5	44.6	40.0	71.0	43.6
OpenAI - o1 - mini	90.0	56.6	46.7	74.4	46.3
7B 模型
Qwen2.5 - Instrust - 7B	76.6	13.3	0.0	37.0	29.1
Qwen2.5 - Math - Instrust - 7B	81.8	20.0	13.3	44.1	31.1
rStar - Math - 7B	78.4*	26.7*	-	-	47.1*
Qwen2.5 - 7B - SimpleRL	82.4*	26.7*	-	-	37.6*
Eurus - 2 - 7B - PRIME	79.2*	26.7*	-	-	42.1*
DeepSeek - R1 - Distill - Qwen - 7B	92.8*	55.5*	40.0	65.6	64.1
OREAL - 7B	91.0	33.3	33.3	62.6	59.9
OREAL - DSR1 - Distill - Qwen - 7B	94.0	50.0	40.0	65.6	66.1
32B 模型
Qwen2.5 - Instrust - 32B	80.6	20.0	13.3	50.8	40.4
QwQ - 32B - Preview	90.6	50.0	40.0	72.7	58.5
DeepSeek - R1 - Distill - Qwen - 32B	94.3*	72.6*	46.7	67.7	71.2
OREAL - 32B	95.0	60.0	46.7	74.8	72.4

⚠️ 重要提示

以上是 OREAL 和每个基线模型的整体评估结果。OREAL - DSR1 - Distill - Qwen - 7B 表示使用 OREAL 训练的 DeepSeek - R1 - Distill - Qwen - 7B 模型。AIME2025 - I、LiveMath 和 Olympiad 分别代表 AIME 2025 Part1、LiveMathBench 和 OlympiadBench。对于 7B 和 32B 参数规模的模型，我们分别使用粗体和斜体表示最佳和次佳性能。部分基线模型的结果直接引用自其报告，并标记为 *。我们使用 LMDeploy 进行推理，使用 OpenCompass 评估模型性能。

📦 模型集合

我们不仅发布了 OREAL 系列的 RL 模型，还发布了 SFT 模型，希望能为社区提供帮助，推动数学推理强化学习的研究。

模型	链接
RL 模型
OREAL - 7B	Hugging Face
OREAL - DSR1 - Distill - Qwen - 7B	Hugging Face
OREAL - 32B	Hugging Face
SFT 模型
OREAL - 7B - SFT	Hugging Face
OREAL - 32B - SFT	Hugging Face

我们还发布了在 RL 训练阶段使用的提示数据。

数据集	链接
RL 提示数据	Hugging Face

💻 使用示例

基础用法

OREAL - 7B 和 OREAL - 32B 在训练和测试时使用系统提示来引导模型进行推理。系统提示如下：

system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi - angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem - solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."

对于 OREAL - DSR1 - Distill - Qwen - 7B，我们使用其原始模型的默认聊天模板。

高级用法

这些模型的聊天模板已经在 tokenizer_config.json 文件中设置好。可以使用 tokenizer.apply_chat_template() 函数来应用聊天模板。

question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)

📚 引用

如果您在研究中发现本工作有帮助，请考虑引用：

@article{lyu2025exploring,
  title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
  author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
  journal={arXiv preprint arXiv:2502.06781},
  year={2025}
}