OREAL-32B-SFTオープンソースの数学推理AIモデル - 複雑な数学問題の解決を無料でサポート

ホーム

OREAL 32B SFT

internlmによって開発

OREAL-32B-SFTはQwen2.5-32Bをベースとした教師あり微調整モデルで、数学推論タスク専用に設計されており、OREA強化学習フレームワークの初期方策モデルです。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #数学推論強化学習 #高精度問題解決 #競技レベルの数学能力

ダウンロード数 18

リリース時間 : 2/10/2025

モデル概要

このモデルはOREALシリーズの32Bパラメータ規模の教師あり微調整バージョンで、主に数学推論タスクに使用され、強化学習トレーニングの出発点として機能します。

モデル特徴

数学推論最適化

数学推論タスクに特化して最適化されており、複雑な数学問題を処理可能

強化学習基盤

OREA強化学習フレームワークの初期方策モデルとして、後続の強化学習トレーニングの基盤を提供

高品質な教師あり微調整

注意深く設計された教師あり微調整プロセスにより、モデルが良好な初期性能を備えることを保証

モデル能力

数学問題解答

論理的推論

多段階問題解決

数学的証明生成

使用事例

教育

数学競技指導

学生が数学競技の問題を解決するのを支援し、段階的な解答を提供

数学学習支援

学生に数学問題の詳細な解答と説明を提供

研究

強化学習研究

強化学習トレーニングの初期方策モデルとして

🚀 OREAL-32B-SFT

OREAL-32B-SFTは、結果報酬ベースの強化学習（OREAL）を用いて訓練された数学的推論モデルです。このモデルは、二値の結果報酬のみが利用可能なタスクに特化した新しい強化学習フレームワークを使用しています。

🚀 クイックスタート

このモデルは、数学的推論タスクにおいて高い精度を達成しています。以下のリンクを通じて、モデルや関連データにアクセスできます。

✨ 主な機能

高精度な数学的推論

OREAL-7B は、MATH-500で94.0 pass@1の精度を達成し、以前の32Bモデルと同等の性能を発揮します。
OREAL-32B は、MATH-500で95.0 pass@1の精度を達成し、以前の蒸留訓練された32Bモデルを上回ります。

新しい強化学習フレームワーク

Outcome Reward-based Reinforcement Learning (OREAL) を使用して訓練されています。
二値の結果報酬のみが利用可能なタスクに特化したフレームワークです。

📚 ドキュメント

概要

我々は、OREAL-7B と OREAL-32B という数学的推論モデルシリーズを導入します。これらのモデルは、二値の結果報酬のみが利用可能なタスクに特化した新しい強化学習フレームワークである Outcome Reward-based Reinforcement Learning (OREAL) を使用して訓練されています。

main_fig

我々の方法は、行動クローニングのためのBest-of-N (BoN) サンプリングを利用し、負のサンプル報酬を再構築して勾配の一貫性を確保します。また、長い思考連鎖推論における希薄な報酬の問題を解決するために、推論軌跡の重要なトークンを識別するオンポリシートークンレベルの報酬モデルを組み込んでいます。詳細については、論文を参照してください。

評価結果

モデル	MATH-500	AIME2024	AIME2025-I	LiveMath	Olympiad
APIモデル
GPT-4o-1120	72.8	16.7	13.3	44.8	33.7
Claude-3.5-Sonnet-1022	78.3	13.3	3.3	46.7	35.4
OpenAI-o1-preview	85.5	44.6	40.0	71.0	43.6
OpenAI-o1-mini	90.0	56.6	46.7	74.4	46.3
7Bモデル
Qwen2.5-Instrust-7B	76.6	13.3	0.0	37.0	29.1
Qwen2.5-Math-Instrust-7B	81.8	20.0	13.3	44.1	31.1
rStar-Math-7B	78.4*	26.7*	-	-	47.1*
Qwen2.5-7B-SimpleRL	82.4*	26.7*	-	-	37.6*
Eurus-2-7B-PRIME	79.2*	26.7*	-	-	42.1*
DeepSeek-R1-Distill-Qwen-7B	92.8*	55.5*	40.0	65.6	64.1
OREAL-7B	91.0	33.3	33.3	62.6	59.9
OREAL-DSR1-Distill-Qwen-7B	94.0	50.0	40.0	65.6	66.1
32Bモデル
Qwen2.5-Instrust-32B	80.6	20.0	13.3	50.8	40.4
QwQ-32B-Preview	90.6	50.0	40.0	72.7	58.5
DeepSeek-R1-Distill-Qwen-32B	94.3*	72.6*	46.7	67.7	71.2
OREAL-32B	95.0	60.0	46.7	74.8	72.4

注: OREAL と各ベースラインの全体的な評価結果です。 OREAL-DSR1-Distill-Qwen-7B は、OREAL で訓練されたDeepSeek-R1-Distill-Qwen-7Bを表します。 AIME2025-I、LiveMath、Olympiad はそれぞれ AIME 2025 Part1、LiveMathBench、OlympiadBench を表します。 7Bと32Bのパラメータ規模のモデルについて、我々はそれぞれ太字と斜体を使用して、最高と2番目に高い性能を表しています。一部のベースラインについては、彼らの報告から直接結果を使用し、* でマークしています。我々は LMDeploy を使用して推論を行い、OpenCompass を使用してモデルの性能を評価しています。

モデルコレクション

我々は、OREAL シリーズの RL モデルだけでなく、SFT モデルもリリースしています。これがコミュニティに役立ち、数学的推論の強化学習の研究に貢献することを願っています。

モデル	リンク
RLモデル
OREAL-7B	Hugging Face
OREAL-DSR1-Distill-Qwen-7B	Hugging Face
OREAL-32B	Hugging Face
SFTモデル
OREAL-7B-SFT	Hugging Face
OREAL-32B-SFT	Hugging Face

我々はまた、RL訓練フェーズで使用したプロンプトもリリースしています。

データセット	リンク
RL Prompts	Hugging Face

💻 使用例

基本的な使用法

OREAL-7BとOREAL-32Bは、訓練とテスト時にモデルを推論させるためにシステムプロンプトを使用します。システムプロンプトは以下の通りです。

system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi-angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem-solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."

高度な使用法

OREAL-DSR1-Distill-Qwen-7Bについては、元のモデルのデフォルトのチャットテンプレートを使用します。

これらのモデルのチャットテンプレートは、tokenizer_config.jsonファイルにすでに設定されています。tokenizer.apply_chat_template() 関数を使用してチャットテンプレートを適用します。

question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)

引用

もしあなたがこの研究が役に立ったと感じた場合は、以下のように引用を考慮してください。

@article{lyu2025exploring,
  title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
  author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
  journal={arXiv preprint arXiv:2502.06781},
  year={2025}
}