Phi-4-reasoning-unsloth-bnb-4bit開源推理模型 - 免費提升數理及編碼推理能力

首頁

Phi 4 Reasoning Unsloth Bnb 4bit

由unsloth開發

Phi-4-reasoning是一款由微軟開發的先進推理模型，基於Phi-4進行微調，專注於提升數學、科學和編碼等領域的推理能力。

大型語言模型

Transformers

支持多種語言開源協議:MIT #數學推理優化 #長上下文處理 #科學問題求解

下載量 1,969

發布時間 : 5/1/2025

模型概述

Phi-4-reasoning是一款開放權重的推理模型，通過監督微調和強化學習訓練，適用於需要複雜推理任務的場景。

模型特點

先進的推理能力

通過監督微調和強化學習，專注於數學、科學和編碼等領域的推理能力提升。

高效的架構設計

基於Phi-4基礎模型，採用14B參數的密集僅解碼器Transformer架構。

長上下文處理能力

支持32k令牌的上下文長度，能夠處理複雜的輸入。

廣泛評估驗證

在多個開源和內部基準測試中進行了評估，展示了出色的性能。

模型能力

數學推理

科學問題解答

代碼生成

算法問題解決

複雜輸入處理

使用案例

教育

數學奧林匹克問題解答

解決高難度的數學奧林匹克問題。

在AIME 2025基準測試中得分62.9。

科學問題解答

回答複雜的科學問題。

在GPQA-Diamond基準測試中得分65.8。

編程

代碼生成

生成功能代碼。

在HumanEvalPlus基準測試中得分92.9。

算法問題解決

解決3SAT和TSP等算法問題。

在LiveCodeBench基準測試中得分53.8。

🚀 Phi-4-reasoning模型卡片

Phi-4-reasoning是一款經過微調的先進推理模型，它基於Phi-4，通過監督微調與強化學習，在特定數據集上進行訓練，專注於提升數學、科學和編碼等方面的推理能力，為生成式AI功能提供了強大的構建模塊。

🚀 快速開始

本模型適用於加速語言模型研究，可作為生成式AI功能的構建模塊。若要使用該模型進行推理，可參考以下內容。

推理參數

推理時，建議設置temperature=0.8，top_p=0.95，並將do_sample設為True。對於更復雜的查詢，可將最大令牌數設置為32k，以支持更長的思維鏈（CoT）。

輸入格式

鑑於訓練數據的性質，推理時請始終使用ChatML模板，並搭配以下系統提示：

<|im_start|>system<|im_sep|>
Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} <\think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|>
<|im_start|>user<|im_sep|>
What is the derivative of x^2?<|im_end|>
<|im_start|>assistant<|im_sep|>

使用`transformers`庫

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

使用`vllm`庫

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

Phi-4-reasoning也可直接在Ollama、llama.cpp和任何與Phi-4兼容的框架中使用。

✨ 主要特性

先進的推理能力：Phi-4-reasoning是一款先進的開放權重推理模型，通過監督微調在思維鏈痕跡數據集上進行訓練，並結合強化學習，專注於提升數學、科學和編碼等方面的推理能力。
高效的架構設計：基於先前發佈的Phi-4基礎模型，擁有14B參數，採用密集的僅解碼器Transformer架構。
長上下文處理能力：支持32k令牌的上下文長度，能夠處理複雜的輸入。
廣泛的評估驗證：在多個開源和內部基準測試中進行了評估，包括數學奧林匹克問題、代碼生成、算法問題解決等，展示了出色的性能。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

使用transformers庫進行推理的基礎示例：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高級用法

使用vllm庫進行推理的高級示例：

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

📚 詳細文檔

模型概述

屬性	詳情
開發者	Microsoft Research
描述	Phi-4-reasoning是一款先進的開放權重推理模型，基於Phi-4進行監督微調，並結合強化學習。監督微調數據集包括合成提示和來自公共領域網站的高質量過濾數據，專注於數學、科學和編碼技能，以及安全和負責任AI的對齊數據。
架構	基礎模型與先前發佈的Phi-4相同，擁有14B參數，採用密集的僅解碼器Transformer架構。
輸入	文本，最適合聊天格式的提示。
上下文長度	32k令牌
GPU要求	32個H100-80G GPU
訓練時間	2.5天
訓練數據	16B令牌，約8.3B唯一令牌
輸出	生成與輸入對應的文本。模型響應分為兩個部分，即思維鏈推理塊和總結塊。
日期	2025年1月 - 2025年4月
狀態	基於離線數據集訓練的靜態模型，公開可用數據的截止日期為2025年3月及更早。
發佈日期	2025年4月30日
許可證	MIT

預期用途

用途類型	詳情
主要用例	該模型旨在加速語言模型研究，作為生成式AI功能的構建模塊。適用於需要內存/計算受限環境、低延遲場景和推理邏輯的通用AI系統和應用（主要為英語）。
超出範圍的用例	該模型僅針對數學推理進行設計和測試，並非針對所有下游用途進行專門設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用前評估和減輕準確性、安全性和公平性問題，特別是在高風險場景中。開發者應遵守適用的法律法規，並參考“負責任AI考慮因素”部分獲取更多指導。

用途類型

詳情

主要用例

該模型旨在加速語言模型研究，作為生成式AI功能的構建模塊。適用於需要內存/計算受限環境、低延遲場景和推理邏輯的通用AI系統和應用（主要為英語）。

超出範圍的用例

該模型僅針對數學推理進行設計和測試，並非針對所有下游用途進行專門設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用前評估和減輕準確性、安全性和公平性問題，特別是在高風險場景中。開發者應遵守適用的法律法規，並參考“負責任AI考慮因素”部分獲取更多指導。

數據概述

訓練數據集

訓練數據包括數學、科學和編碼領域的問答和聊天格式數據。聊天提示來自過濾後的高質量網絡數據，並可通過合成數據生成管道進行重寫和處理。此外，還包括提高真實性和安全性的數據。

基準數據集

使用開源的Eureka評估套件和內部基準測試對Phi-4-reasoning進行評估，具體包括：

推理任務：AIME 2025、2024、2023和2022數學奧林匹克問題、GPQA-Diamond複雜科學問題、OmniMath奧林匹克級數學問題集、LiveCodeBench代碼生成基準、3SAT和TSP算法問題解決、BA Calendar規劃、Maze和SpatialMap空間理解。
通用基準：Kitab信息檢索、IFEval和ArenaHard指令遵循、PhiBench內部基準、FlenQA提示長度對模型性能的影響、HumanEvalPlus功能代碼生成、MMLU-Pro多任務語言理解聚合數據集。

安全性

方法

Phi-4-reasoning採用了強大的安全後訓練方法，通過監督微調（SFT），利用多種開源和內部生成的合成提示，以及遵循嚴格Microsoft安全指南的LLM生成響應。

安全評估和紅隊測試

在發佈前，Phi-4-reasoning採用了多方面的評估方法。通過多個開源安全基準和內部工具進行定量評估，利用對抗性對話模擬。與Microsoft的獨立AI紅隊（AIRT）合作進行定性安全評估，評估在平均和對抗性用戶場景下的安全風險。還在Toxigen基準上評估模型的偏差和毒性。

模型質量

在代表性基準上的模型質量概述：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

模型	FlenQA [3K-token subset]	IFEval Strict	ArenaHard	HumanEvalPlus	MMLUPro	Kitab No Context - Precision With Context - Precision No Context - Recall With Context - Recall	Toxigen Discriminative Toxic category Neutral category	PhiBench 2.21
Phi-4	19.3 88.5 8.2 68.1	62.3	68.1	83.5	71.5	19.3 88.5 8.2 68.1	72.6 90.0	58.2
Phi-4-reasoning	23.2 91.5 4.9 74.8	83.4	73.3	92.9	74.3	23.2 91.5 4.9 74.8	86.7 84.7	70.6
Phi-4-reasoning-plus	27.6 93.6 6.3 75.4	84.9	79.0	92.3	76.0	27.6 93.6 6.3 75.4	77.3 90.5	74.2
o3-mini	37.9 94.0 4.2 76.1	91.5	81.9	94.0	79.4	37.9 94.0 4.2 76.1	85.4 88.7	78.0
GPT-4o	53.7 84.7 20.3 69.2	81.8	75.6	88.0	73.0	53.7 84.7 20.3 69.2	87.6 85.1	72.4

總體而言，Phi-4-reasoning僅擁有14B參數，在廣泛的推理任務中表現出色，顯著優於更大的開放權重模型，如DeepSeek-R1蒸餾70B模型，並接近完整的DeepSeek R1模型的性能水平。

負責任AI考慮因素

與其他語言模型一樣，Phi-4-reasoning可能存在不公平、不可靠或冒犯性的行為。開發者應應用負責任AI最佳實踐，並確保特定用例符合相關法律法規。建議使用Azure AI Content Safety等安全服務。

限制行為

服務質量：模型主要基於英語文本訓練，非英語語言的性能會較差。訓練數據中代表性較少的英語變體可能比標準美式英語的性能更差。Phi-4-reasoning不支持多語言使用。
傷害表示與刻板印象延續：模型可能過度或不足地表示某些人群，抹去某些群體的代表性，或強化貶低性或負面刻板印象。
不適當或冒犯性內容：模型可能產生其他類型的不適當或冒犯性內容，在敏感上下文中部署時需要額外的緩解措施。
信息可靠性：語言模型可能生成無意義的內容或編造聽起來合理但不準確或過時的內容。
選舉信息可靠性：模型在回答選舉關鍵查詢時存在較高的缺陷率，可能導致呈現不正確或無權威性的選舉關鍵信息。
代碼範圍有限：Phi-4-reasoning的大部分訓練數據基於Python，並使用常見包。如果模型生成使用其他包或其他語言的Python腳本，建議用戶手動驗證所有API使用。