Phi-4-reasoning-GGUF開源推理模型 - 免費助力數學、科學和編碼推理

首頁

Phi 4 Reasoning GGUF

由unsloth開發

Phi-4-reasoning是基於Phi-4微調的先進推理模型，通過監督微調與強化學習，在數學、科學和編碼等領域展現出卓越的推理能力。

大型語言模型

Transformers

開源協議:MIT #數學推理 #科學問題求解 #長文本推理

下載量 6,046

發布時間 : 5/1/2025

模型概述

Phi-4-reasoning是一個專注於數學、科學和編碼推理的語言模型，適用於對推理和邏輯有較高要求的場景。

模型特點

先進的推理能力

通過監督微調與強化學習，在數學、科學和編碼等領域展現出卓越的推理能力。

高效的性能

在多個推理任務和通用能力基準測試中表現出色，超越了許多更大參數的開放權重模型。

廣泛的適用性

適用於對推理和邏輯有較高要求的場景，如內存/計算受限的環境、低延遲場景等。

安全後訓練

採用了強大的安全後訓練方法，通過監督微調（SFT）確保模型的安全性和道德性。

模型能力

數學推理

科學問題解答

代碼生成

複雜問題解決

邏輯推理

使用案例

教育

數學奧林匹克問題解答

解決AIME等數學奧林匹克競賽中的複雜問題。

在AIME 2025上達到62.9%的準確率

研究生級科學問題解答

解答GPQA-Diamond等複雜的研究生級科學問題。

在GPQA-Diamond上達到65.8%的準確率

編程

競賽代碼生成

生成競賽級別的代碼解決方案。

在LiveCodeBench上達到53.8%的準確率

🚀 Phi-4-reasoning模型

Phi-4-reasoning是一個基於Phi-4微調的先進推理模型，通過監督微調與強化學習，在數學、科學和編碼等領域展現出卓越的推理能力，適用於對推理和邏輯有較高要求的場景。

🚀 快速開始

推理參數設置

推理時，建議使用 temperature=0.8、top_p=0.95 並設置 do_sample=True。對於更復雜的查詢，可將最大令牌數設置為 32k，以支持更長的思維鏈（CoT）。

輸入格式

鑑於訓練數據的特性，推理時請始終使用 ChatML 模板，並搭配以下系統提示：

<|im_start|>system<|im_sep|>
Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} <\think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|>
<|im_start|>user<|im_sep|>
What is the derivative of x^2?<|im_end|>
<|im_start|>assistant<|im_sep|>

使用 `transformers` 庫

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

使用 `vllm` 庫

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

Phi-4-reasoning 還可直接在 Ollama、llama.cpp 以及任何與 Phi-4 兼容的框架中使用。

✨ 主要特性

先進的推理能力：基於監督微調與強化學習，在數學、科學和編碼等領域展現出卓越的推理能力。
高效的性能：在多個推理任務和通用能力基準測試中表現出色，超越了許多更大參數的開放權重模型。
廣泛的適用性：適用於對推理和邏輯有較高要求的場景，如內存/計算受限的環境、低延遲場景等。

📦 安裝指南

文檔未提及具體安裝步驟，可參考相關框架（如 transformers、vllm）的官方文檔進行安裝。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高級用法

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

📚 詳細文檔

模型概述

屬性	詳情
開發者	微軟研究院
描述	Phi-4-reasoning 是一個先進的開放權重推理模型，基於 Phi-4 進行監督微調與強化學習。監督微調數據集包含合成提示和來自公共領域網站的高質量過濾數據，專注於數學、科學和編碼技能以及安全和負責任 AI 的對齊數據。
架構	基礎模型與之前發佈的 Phi-4 相同，14B 參數，僅解碼器的密集 Transformer 模型
輸入	文本，最適合聊天格式的提示
上下文長度	32k 令牌
GPU	32 個 H100-80G
訓練時間	2.5 天
訓練數據	16B 令牌，約 8.3B 唯一令牌
輸出	對輸入的生成文本。模型響應分為兩個部分，即推理思維鏈塊和總結塊
日期	2025 年 1 月 - 2025 年 4 月
狀態	基於離線數據集訓練的靜態模型，公開可用數據截止日期為 2025 年 3 月及更早
發佈日期	2025 年 4 月 30 日
許可證	MIT

預期用途

用途類型	詳情
主要用例	該模型旨在加速語言模型研究，作為生成式 AI 功能的構建塊。適用於需要在內存/計算受限環境、低延遲場景以及推理和邏輯方面有較高要求的通用 AI 系統和應用（主要為英文）。
非預期用例	該模型僅針對數學推理進行設計和測試，並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用前評估和緩解準確性、安全性和公平性問題，特別是在高風險場景中。開發者應瞭解並遵守與用例相關的適用法律或法規（包括隱私、貿易合規法律等），包括模型對英文的專注。選擇用例時，請參考下面的負責任 AI 考慮部分以獲取更多指導。本模型卡中的任何內容均不應被解釋為或視為對模型發佈許可證的限制或修改。

用途類型

詳情

主要用例

該模型旨在加速語言模型研究，作為生成式 AI 功能的構建塊。適用於需要在內存/計算受限環境、低延遲場景以及推理和邏輯方面有較高要求的通用 AI 系統和應用（主要為英文）。

非預期用例

該模型僅針對數學推理進行設計和測試，並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用前評估和緩解準確性、安全性和公平性問題，特別是在高風險場景中。開發者應瞭解並遵守與用例相關的適用法律或法規（包括隱私、貿易合規法律等），包括模型對英文的專注。選擇用例時，請參考下面的負責任 AI 考慮部分以獲取更多指導。本模型卡中的任何內容均不應被解釋為或視為對模型發佈許可證的限制或修改。

數據概述

訓練數據集

訓練數據是數學、科學和編碼領域的問答和聊天格式數據的混合。聊天提示來自過濾後的高質量網絡數據，並可選擇通過合成數據生成管道進行重寫和處理。此外，還包括提高真實性和安全性的數據。

基準數據集

使用開源的 Eureka 評估套件和內部基準評估 Phi-4-reasoning 的能力。具體評估任務包括：

推理任務：AIME 2025、2024、2023 和 2022 數學奧林匹克問題、GPQA-Diamond 複雜的研究生級科學問題、OmniMath 超過 4000 個奧林匹克級數學問題的集合、LiveCodeBench 來自競賽編碼比賽的代碼生成基準、3SAT 和 TSP 算法問題解決、BA Calendar 規劃、Maze 和 SpatialMap 空間理解。
通用基準：Kitab 信息檢索、IFEval 和 ArenaHard 指令跟隨、PhiBench 內部基準、FlenQA 提示長度對模型性能的影響、HumanEvalPlus 功能代碼生成、MMLU-Pro 流行的多任務語言理解聚合數據集。

安全性

方法

Phi-4-reasoning 採用了強大的安全後訓練方法，通過監督微調（SFT）。該方法利用了各種開源和內部生成的合成提示，以及符合微軟嚴格安全指南的 LLM 生成響應，例如用戶理解和清晰度、安全和道德指南、限制、免責聲明和知識範圍、處理複雜和敏感主題、安全和尊重互動、指南的保密性和思維鏈的保密性。

安全評估和紅隊測試

在發佈之前，Phi-4-reasoning 遵循了多方面的評估方法。使用多個開源安全基準和內部工具進行定量評估，利用對抗性對話模擬。為了進行定性安全評估，與微軟的獨立 AI 紅隊（AIRT）合作，評估 Phi-4-reasoning 在平均和對抗性用戶場景中的安全風險。在平均用戶場景中，AIRT 模擬典型的單輪和多輪交互，以識別潛在的風險行為。在對抗性用戶場景中，測試了各種旨在故意破壞模型安全訓練的技術，包括基礎性、越獄、有害內容（如仇恨和不公平、暴力、性內容或自我傷害）以及受保護材料的版權侵犯。還在 Toxigen 基準上評估模型，該基準旨在衡量針對少數群體的偏見和毒性。

模型質量

在代表性基準上對模型質量進行了高級概述。以下表格中，數字越高表示性能越好：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

模型	FlenQA [3K-token subset]	IFEval Strict	ArenaHard	HumanEvalPlus	MMLUPro	Kitab（無上下文 - 精度、有上下文 - 精度、無上下文 - 召回率、有上下文 - 召回率）	Toxigen 判別（有毒類別、中性類別）	PhiBench 2.21
Phi-4	82.0	62.3	68.1	83.5	71.5	19.3 88.5 8.2 68.1	72.6 90.0	58.2
Phi-4-reasoning	97.7	83.4	73.3	92.9	74.3	23.2 91.5 4.9 74.8	86.7 84.7	70.6
Phi-4-reasoning-plus	97.9	84.9	79.0	92.3	76.0	27.6 93.6 6.3 75.4	77.3 90.5	74.2
o3-mini	96.8	91.5	81.9	94.0	79.4	37.9 94.0 4.2 76.1	85.4 88.7	78.0
GPT-4o	90.8	81.8	75.6	88.0	73.0	53.7 84.7 20.3 69.2	87.6 85.1	72.4

總體而言，Phi-4-reasoning 僅 14B 參數，在廣泛的推理任務中表現出色，顯著超越了許多更大的開放權重模型，如 DeepSeek-R1 蒸餾 70B 模型，並接近完整的 DeepSeek R1 模型的性能水平。在多個新的推理基準測試中，包括 3SAT、TSP 和 BA-Calendar，模型也表現出了強大的泛化能力。此外，在標準通用能力基準測試中，如指令跟隨或非推理任務，新模型相比 Phi-4 有了顯著改進，儘管後訓練主要集中在特定領域的推理技能上。

負責任 AI 考慮

與其他語言模型一樣，Phi-4-reasoning 可能會表現出不公平、不可靠或冒犯性的行為。需要注意的一些限制行為包括：

服務質量：模型主要在英文文本上進行訓練，非英文語言的性能會較差。訓練數據中代表性較少的英文變體可能比標準美式英語的性能更差。Phi-4-reasoning 不支持多語言使用。
傷害的代表性和刻板印象的延續：這些模型可能會過度或不足地代表某些人群，抹去某些群體的代表性。

🔧 技術細節

Phi-4-reasoning 基於 Phi-4 進行監督微調與強化學習，監督微調數據集包含合成提示和來自公共領域網站的高質量過濾數據，專注於數學、科學和編碼技能以及安全和負責任 AI 的對齊數據。模型採用了強大的安全後訓練方法，通過監督微調（SFT），利用各種開源和內部生成的合成提示，以及符合微軟嚴格安全指南的 LLM 生成響應。