Phi-4-reasoning-plus-GGUF開源推理模型 - 免費助力數學、科學和編程推理

首頁

Phi 4 Reasoning Plus GGUF

由unsloth開發

Phi-4-reasoning-plus 是由微軟研究院開發的開源推理模型，專注於數學、科學和編程領域的高級推理能力。

大型語言模型支持多種語言開源協議:MIT #數學推理優化 #強化學習增強 #科學編程專用

下載量 109.62k

發布時間 : 5/1/2025

模型概述

Phi-4-reasoning-plus 是一個基於 Phi-4 的先進推理模型，通過監督微調和強化學習在鏈式思維跟蹤數據集上進行訓練，專注於數學、科學和編程技能。

模型特點

高級推理能力

專注於數學、科學和編程領域的高級推理任務，通過監督微調和強化學習優化。

長上下文支持

支持長達32k標記的上下文長度，適合處理複雜任務。

高性能

在多個推理基準測試中表現優異，優於同類模型。

安全對齊

通過嚴格的安全後訓練方法，確保模型在安全和道德準則下的使用。

模型能力

數學問題解答

科學問題解答

編程問題解決

鏈式思維推理

文本生成

使用案例

教育

數學奧林匹克問題解答

解決高難度的數學奧林匹克問題，如AIME和OmniMath中的題目。

在AIME 2025上達到78.0%的準確率。

研究生水平科學問題解答

解答覆雜的、研究生水平的科學問題，如GPQA-Diamond中的題目。

在GPQA-D上達到68.9%的準確率。

編程

競爭性編程問題解答

解決來自競爭性編程競賽的代碼生成問題，如LiveCodeBench中的題目。

在LiveCodeBench上達到53.1%的準確率。

🚀 Phi-4-reasoning-plus

Phi-4-reasoning-plus是一個經過微調的先進推理模型，基於Phi-4進行監督微調與強化學習訓練。它在數學、科學和編碼等推理任務上表現出色，適用於對內存、計算和延遲有要求的場景。

🚀 快速開始

你可以使用transformers庫來運行Phi-4-reasoning-plus模型，示例代碼如下：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

你也可以使用vllm來運行模型：

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

⚠️ 重要提示

你必須在llama.cpp中使用--jinja來啟用推理。否則，將不會提供<think>標記。

💡 使用建議

推理時，使用temperature=0.8、top_p=0.95和do_sample=True效果更佳。對於更復雜的查詢，將最大令牌數設置為32k，以支持更長的思維鏈（CoT）。

✨ 主要特性

先進的推理能力：通過監督微調與強化學習訓練，在數學、科學和編碼等推理任務上表現出色。
高效的架構：基於14B參數的密集解碼器Transformer模型，適用於內存和計算受限的環境。
長上下文處理：支持32k令牌的上下文長度，能夠處理複雜的輸入。
多場景適用：適用於對延遲有要求的場景，以及需要推理和邏輯的通用AI系統和應用。

📦 安裝指南

文檔未提供具體安裝步驟，可參考上述快速開始部分的代碼示例進行使用。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高級用法

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

📚 詳細文檔

模型概述

屬性	詳情
開發者	微軟研究院
描述	Phi-4-reasoning-plus是一個最先進的開放權重推理模型，它基於Phi-4進行監督微調與強化學習訓練。監督微調數據集包括合成提示和來自公共領域網站的高質量過濾數據，重點關注數學、科學和編碼技能，以及安全和負責任AI的對齊數據。該模型還進行了強化學習訓練，因此具有更高的準確性，但平均生成的令牌數增加了50%，從而導致更高的延遲。
架構	基礎模型與之前發佈的Phi-4相同，14B參數，密集解碼器Transformer模型
輸入	文本，最適合聊天格式的提示
上下文長度	32k令牌
GPU	32個H100-80G
訓練時間	2.5天
訓練數據	16B令牌，約8.3B唯一令牌
輸出	對輸入的生成文本，模型響應分為思維鏈塊和總結塊兩部分
日期	2025年1月 - 2025年4月
狀態	基於離線數據集訓練的靜態模型，公開可用數據的截止日期為2025年3月及更早
發佈日期	2025年4月30日
許可證	MIT

預期用途

用途類型	詳情
主要用例	該模型旨在加速語言模型的研究，作為生成式AI功能的構建塊。它適用於通用AI系統和應用（主要是英語），需要在內存/計算受限的環境、對延遲有要求的場景以及需要推理和邏輯的場景中使用。
非預期用例	該模型僅針對數學推理進行設計和測試，並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見限制，並在特定下游用例中使用之前評估和減輕準確性、安全性和公平性問題，特別是在高風險場景中。開發者應遵守適用的法律法規（包括隱私、貿易合規等）。

數據概述

訓練數據集

訓練數據是數學、科學和編碼領域的問答和聊天格式數據的混合。聊天提示來自過濾後的高質量網絡數據，並可選擇通過合成數據生成管道進行重寫和處理。此外，還包括提高真實性和安全性的數據。

基準數據集

使用開源的Eureka評估套件和內部基準評估Phi-4-reasoning-plus的能力，具體評估任務包括：

推理任務：AIME 2025、2024、2023和2022數學奧林匹克問題、GPQA-Diamond複雜科學問題、OmniMath奧林匹克級數學問題集、LiveCodeBench代碼生成基準、3SAT和TSP算法問題解決、BA Calendar規劃、Maze和SpatialMap空間理解。
通用基準：Kitab信息檢索、IFEval和ArenaHard指令跟隨、PhiBench內部基準、FlenQA提示長度對模型性能的影響、HumanEvalPlus功能代碼生成、MMLU-Pro多任務語言理解聚合數據集。

安全性

方法

Phi-4-reasoning-plus採用了強大的安全後訓練方法，通過監督微調（SFT），利用各種開源和內部生成的合成提示，以及符合微軟嚴格安全指南的LLM生成響應。

安全評估和紅隊測試

在發佈之前，Phi-4-reasoning-plus採用了多方面的評估方法。通過多個開源安全基準和內部工具進行定量評估，利用對抗性對話模擬。在定性安全評估方面，與微軟的獨立AI紅隊（AIRT）合作，評估在平均和對抗性用戶場景中模型帶來的安全風險。還在Toxigen基準上評估模型，該基準旨在測量針對少數群體的偏差和毒性。

模型質量

在代表性基準上的模型質量概述，以下表格中，數值越高表示性能越好：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

基準	Phi-4	Phi-4-reasoning	Phi-4-reasoning-plus	o3-mini	GPT-4o
FlenQA [3K-token subset]	82.0	97.7	97.9	96.8	90.8
IFEval Strict	62.3	83.4	84.9	91.5	81.8
ArenaHard	68.1	73.3	79.0	81.9	75.6
HumanEvalPlus	83.5	92.9	92.3	94.0	88.0
MMLUPro	71.5	74.3	76.0	79.4	73.0
Kitab 無上下文 - 精度有上下文 - 精度無上下文 - 召回率有上下文 - 召回率	19.3 88.5 8.2 68.1	23.2 91.5 4.9 74.8	27.6 93.6 6.3 75.4	37.9 94.0 4.2 76.1	53.7 84.7 20.3 69.2
Toxigen Discriminative 有毒類別中性類別	72.6 90.0	86.7 84.7	77.3 90.5	85.4 88.7	87.6 85.1
PhiBench 2.21	58.2	70.6	74.2	78.0	72.4