Phi-4推理增強版開源AI模型 - 免費助力數學、科學與編程推理應用

首頁

Phi 4 Reasoning Plus

由unsloth開發

Phi-4推理增強版是微軟研究院開發的140億參數開源推理模型，通過監督微調和強化學習優化，專注於數學、科學和編程領域的高級推理能力。

大型語言模型

Transformers

支持多種語言開源協議:MIT #數學推理增強 #長上下文推理 #代碼生成優化

下載量 189

發布時間 : 5/1/2025

模型概述

基於Phi-4的增強版推理模型，通過高質量數據集和強化學習訓練，在數學推理、代碼生成和科學問題解決方面表現優異，支持32k上下文長度。

模型特點

強化推理能力

通過思維鏈追蹤數據集和強化學習優化，顯著提升複雜推理任務的準確性

長上下文處理

支持32k標記的上下文長度，可處理深度多步推理任務

高效架構

僅140億參數的小型模型實現接近更大模型的性能

安全對齊

通過嚴格的安全後訓練方法確保符合負責任AI準則

模型能力

數學問題求解

科學推理

代碼生成

算法問題解決

邏輯推理

多輪對話

使用案例

教育

數學奧賽輔導

解決AIME等數學競賽的複雜問題

在AIME 2025測試中達到78%準確率

科研

科學問題分析

解答研究生級別的科學問題

GPQA-Diamond基準68.9%準確率

軟件開發

競賽級代碼生成

解決編程競賽問題

LiveCodeBench基準53.1%準確率

🚀 Phi-4-reasoning-plus模型卡片

Phi-4-reasoning-plus是一款經過微調的先進推理模型，它基於Phi-4模型，在思維鏈軌跡數據集上進行監督微調，並結合強化學習訓練而成。該模型在推理任務上表現出色，尤其適用於對推理和邏輯要求較高的場景。

🚀 快速開始

推理參數

推理時，建議設置 temperature=0.8，top_p=0.95，並將 do_sample 設置為 True。對於更復雜的查詢，可將最大令牌數設置為 32k，以支持更長的思維鏈（CoT）。

Phi-4-reasoning-plus在推理密集型任務中表現出色。在實驗中，我們將其最大令牌數擴展到 64k，它在處理長序列時取得了不錯的結果，能夠在長輸入下保持連貫性和邏輯一致性。這使其成為需要深度、多步推理或大量上下文的任務的理想選擇。

輸入格式

鑑於訓練數據的性質，推理時請始終使用 ChatML 模板，並使用以下系統提示：

<|im_start|>system<|im_sep|>
Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} <\think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|>
<|im_start|>user<|im_sep|>
What is the derivative of x^2?<|im_end|>
<|im_start|>assistant<|im_sep|>

使用 `transformers` 庫

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

使用 `vllm` 庫

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

Phi-4-reasoning-plus還支持Ollama、llama.cpp和任何與Phi-4兼容的框架。

✨ 主要特性

先進的推理能力：基於Phi-4模型進行監督微調與強化學習，在推理任務上表現出色。
高效的上下文處理：支持32k甚至64k的上下文長度，能處理長序列輸入。
廣泛的任務適用性：在數學推理、代碼生成、規劃等多種任務上有良好表現。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高級用法

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

📚 詳細文檔

模型概述

屬性	詳情
開發者	微軟研究院
描述	Phi-4-reasoning-plus是一款最先進的開放權重推理模型，它基於Phi-4模型，在思維鏈軌跡數據集上進行監督微調，並結合強化學習訓練而成。監督微調數據集包括合成提示和來自公共領域網站的高質量過濾數據，重點關注數學、科學和編碼技能，以及安全和負責任AI的對齊數據。該方法的目標是確保使用專注於高質量和高級推理的數據來訓練小而強大的模型。Phi-4-reasoning-plus還額外進行了強化學習訓練，因此具有更高的準確性，但平均生成的令牌數增加了50%，因此延遲也更高。
架構	基礎模型與之前發佈的Phi-4相同，具有140億個參數，是一個密集的僅解碼器Transformer模型
輸入	文本，最適合聊天格式的提示
上下文長度	32k令牌
GPU	32個H100 - 80G
訓練時間	2.5天
訓練數據	160億個令牌，約83億個唯一令牌
輸出	對輸入的生成文本。模型響應有兩個部分，即思維鏈推理塊和總結塊
日期	2025年1月 - 2025年4月
狀態	基於離線數據集訓練的靜態模型，公開可用數據的截止日期為2025年3月及更早
發佈日期	2025年4月30日
許可證	MIT

預期用途

用途類型	詳情
主要用例	我們的模型旨在加速語言模型的研究，用作生成式AI功能的構建塊。它適用於通用AI系統和應用程序（主要是英文），這些應用程序需要：1. 內存/計算受限的環境；2. 低延遲場景；3. 推理和邏輯。
非預期用例	此模型僅針對數學推理進行設計和測試。我們的模型並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用之前評估和減輕準確性、安全性和公平性問題，特別是在高風險場景中。開發者應瞭解並遵守與其用例相關的適用法律或法規（包括隱私、貿易合規法等），包括模型對英文的關注。在選擇用例時，請參考下面的負責任AI考慮部分以獲取更多指導。本模型卡片中的任何內容均不應被解釋為或視為對模型發佈許可證的限制或修改。

用途類型

詳情

主要用例

我們的模型旨在加速語言模型的研究，用作生成式AI功能的構建塊。它適用於通用AI系統和應用程序（主要是英文），這些應用程序需要：1. 內存/計算受限的環境；2. 低延遲場景；3. 推理和邏輯。

非預期用例

此模型僅針對數學推理進行設計和測試。我們的模型並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用之前評估和減輕準確性、安全性和公平性問題，特別是在高風險場景中。開發者應瞭解並遵守與其用例相關的適用法律或法規（包括隱私、貿易合規法等），包括模型對英文的關注。在選擇用例時，請參考下面的負責任AI考慮部分以獲取更多指導。本模型卡片中的任何內容均不應被解釋為或視為對模型發佈許可證的限制或修改。

數據概述

訓練數據集

我們的訓練數據是數學、科學和編碼領域的問答和聊天格式數據的混合。聊天提示來自過濾後的高質量網絡數據，並可選擇通過合成數據生成管道進行重寫和處理。我們還包括了提高真實性和安全性的數據。

基準數據集

我們使用開源的 Eureka 評估套件和我們自己的內部基準來評估Phi-4-reasoning-plus，以瞭解模型的能力。具體來說，我們在以下任務上評估我們的模型：

推理任務：
- AIME 2025、2024、2023和2022：數學奧林匹克問題。
- GPQA - Diamond：複雜的研究生水平科學問題。
- OmniMath：超過4000個奧林匹克水平數學問題的集合，有人工註釋。
- LiveCodeBench：從競爭性編碼競賽中收集的代碼生成基準。
- 3SAT（3 - 文字可滿足性問題）和TSP（旅行商問題）：算法問題解決。
- BA Calendar：規劃。
- Maze和SpatialMap：空間理解。
通用基準：
- Kitab：信息檢索。
- IFEval和ArenaHard：指令遵循。
- PhiBench：內部基準。
- FlenQA：提示長度對模型性能的影響。
- HumanEvalPlus：功能代碼生成。
- MMLU - Pro：流行的多任務語言理解聚合數據集。

安全性

方法

Phi-4-reasoning-plus通過監督微調（SFT）採用了強大的安全後訓練方法。這種方法利用了各種開源和內部生成的合成提示，以及符合嚴格微軟安全指南的大語言模型生成的響應，例如用戶理解和清晰度、安全和道德指南、限制、免責聲明和知識範圍、處理複雜和敏感主題、安全和尊重參與、指南的保密性和思維鏈的保密性。

安全評估和紅隊測試

在發佈之前，Phi-4-reasoning-plus遵循了多方面的評估方法。使用多個開源安全基準和內部工具進行了定量評估，利用對抗性對話模擬。對於定性安全評估，我們與微軟的獨立AI紅隊（AIRT）合作，評估Phi-4-reasoning-plus在平均和對抗性用戶場景中的安全風險。在平均用戶場景中，AIRT模擬了典型的單輪和多輪交互，以識別潛在的風險行為。在對抗性用戶場景中，測試了各種旨在故意破壞模型安全訓練的技術，包括真實性、越獄、有害內容（如仇恨和不公平、暴力、性內容或自殘）以及受保護材料的版權侵犯。我們還在Toxigen基準上評估模型，該基準旨在衡量針對少數群體的偏見和毒性。

請參考技術報告以獲取更多關於安全對齊的詳細信息。

模型質量

以下是模型在代表性基準上的質量概述。對於以下表格，數字越高表示性能越好：

模型	AIME 24	AIME 25	OmniMath	GPQA - D	LiveCodeBench (8/1/24–2/1/25)
Phi - 4 - reasoning	75.3	62.9	76.6	65.8	53.8
Phi - 4 - reasoning - plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2 - 32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE - Deep - 32B	72.1	65.8	—	66.1	59.5
DeepSeek - R1 - Distill - 70B	69.3	51.5	63.4	66.2	57.5
DeepSeek - R1	78.7	70.4	85.0	73.0	62.8
o1 - mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3 - mini	88.0	78.0	74.6	77.7	69.5
Claude - 3.7 - Sonnet	55.3	58.7	54.6	76.8	—
Gemini - 2.5 - Pro	92.0	86.7	61.1	84.0	69.2

模型	FlenQA [3K - 令牌子集]	IFEval Strict	ArenaHard	HumanEvalPlus	MMLUPro	Kitab（無上下文 - 精度、有上下文 - 精度、無上下文 - 召回率、有上下文 - 召回率）	Toxigen判別式（有毒類別、中性類別）	PhiBench 2.21
Phi - 4	82.0	62.3	68.1	83.5	71.5	19.3 88.5 8.2 68.1	72.6 90.0	58.2
Phi - 4 - reasoning	97.7	83.4	73.3	92.9	74.3	23.2 91.5 4.9 74.8	86.7 84.7	70.6
Phi - 4 - reasoning - plus	97.9	84.9	79.0	92.3	76.0	27.6 93.6 6.3 75.4	77.3 90.5	74.2
o3 - mini	96.8	91.5	81.9	94.0	79.4	37.9 94.0 4.2 76.1	85.4 88.7	78.0
GPT - 4o	90.8	81.8	75.6	88.0	73.0	53.7 84.7 20.3 69.2	87.6 85.1	72.4

總體而言，Phi-4-reasoning和Phi-4-reasoning-plus僅具有140億個參數，但在廣泛的推理任務中表現出色，顯著優於如DeepSeek-R1蒸餾70億模型等更大的開放權重模型，並接近完整DeepSeek R1模型的性能水平。我們還在多個新的推理基準上測試了模型，包括3SAT、TSP和BA - 日曆等算法問題解決和規劃任務。這些新任務對於模型來說名義上是域外任務，因為訓練過程並未有意針對這些技能，但模型仍然對這些任務表現出強大的泛化能力。此外，在針對標準通用能力基準（如指令遵循或非推理任務）評估性能時，我們發現我們的新模型相比Phi-4有了顯著改進，儘管後訓練主要集中在特定領域的推理技能上。

負責任AI考慮

與其他語言模型一樣，Phi-4-reasoning-plus可能會表現出不公平、不可靠或冒犯性的行為。需要注意的一些限制行為包括：

服務質量：模型主要基於英文文本進行訓練。非英文語言的性能會較差。訓練數據中代表性較少的英語變體可能比標準美式英語的性能更差。Phi-4-reasoning-plus不支持多語言使用。
傷害的代表性和刻板印象的延續：這些模型可能會過度或不足地代表某些人群，抹去某些群體的代表性，或強化貶低或負面的刻板印象。儘管進行了安全後訓練，但由於不同群體的代表性水平不同或訓練數據中負面刻板印象示例的普遍性反映了現實世界的模式和社會偏見，這些限制可能仍然存在。
不適當或冒犯性內容：這些模型可能會產生其他類型的不適當或冒犯性內容，這可能使其在沒有針對特定用例的額外緩解措施的情況下，不適合在敏感環境中部署。
信息可靠性：語言模型可能會生成無意義的內容或編造聽起來合理但不準確或過時的內容。
選舉信息可靠性：模型在回答與選舉關鍵查詢時的缺陷率較高，這可能導致提供不正確或無權威性的選舉關鍵信息。我們正在努力提高模型在這方面的性能。用戶應向所在地區的選舉機構核實與選舉相關的信息。
代碼範圍有限：Phi-4-reasoning-plus的大部分訓練數據基於Python，並使用常見的包，如 typing、math、random、collections、datetime、itertools。如果模型生成使用其他包的Python腳本或其他語言的腳本，我們強烈建議用戶手動驗證所有API的使用。

開發者應應用負責任AI的最佳實踐，並負責確保特定用例符合相關法律法規（如隱私、貿易等）。強烈建議使用具有先進護欄的安全服務，如 Azure AI Content Safety。需要考慮的重要領域包括：

分配：在沒有進一步評估和額外去偏技術的情況下，模型可能不適合對法律地位、資源分配或生活機會（如住房、就業、信貸等）有重大影響的場景。
高風險場景：開發者應評估在高風險場景中使用模型的適用性，在這些場景中，不公平、不可靠或冒犯性的輸出可能會造成極大的代價或傷害。這包括在敏感或專家領域提供建議（如法律或健康建議），其中準確性和可靠性至關重要。應根據部署上下文在應用程序級別實施額外的保障措施。
錯誤信息：模型可能會產生不準確的信息。開發者應遵循透明度最佳實踐，並告知最終用戶他們正在與AI系統交互。在應用程序級別，開發者可以構建反饋機制和管道，將響應基於特定用例的上下文信息，這種技術稱為檢索增強生成（RAG）。
有害內容生成：開發者應根據上下文評估輸出，並使用適合其用例的可用安全分類器或自定義解決方案。
濫用：可能存在其他形式的濫用，如欺詐、垃圾郵件或惡意軟件生產，開發者應確保其應用程序不違反適用的法律法規。