Phi-4-reasoning-GGUF开源推理模型 - 免费助力数学、科学和编码推理

首页

Phi 4 Reasoning GGUF

由 unsloth 开发

Phi-4-reasoning是基于Phi-4微调的先进推理模型，通过监督微调与强化学习，在数学、科学和编码等领域展现出卓越的推理能力。

大型语言模型

Transformers

开源协议:MIT #数学推理 #科学问题求解 #长文本推理

下载量 6,046

发布时间 : 5/1/2025

模型简介

Phi-4-reasoning是一个专注于数学、科学和编码推理的语言模型，适用于对推理和逻辑有较高要求的场景。

模型特点

先进的推理能力

通过监督微调与强化学习，在数学、科学和编码等领域展现出卓越的推理能力。

高效的性能

在多个推理任务和通用能力基准测试中表现出色，超越了许多更大参数的开放权重模型。

广泛的适用性

适用于对推理和逻辑有较高要求的场景，如内存/计算受限的环境、低延迟场景等。

安全后训练

采用了强大的安全后训练方法，通过监督微调（SFT）确保模型的安全性和道德性。

模型能力

数学推理

科学问题解答

代码生成

复杂问题解决

逻辑推理

使用案例

教育

数学奥林匹克问题解答

解决AIME等数学奥林匹克竞赛中的复杂问题。

在AIME 2025上达到62.9%的准确率

研究生级科学问题解答

解答GPQA-Diamond等复杂的研究生级科学问题。

在GPQA-Diamond上达到65.8%的准确率

编程

竞赛代码生成

生成竞赛级别的代码解决方案。

在LiveCodeBench上达到53.8%的准确率

🚀 Phi-4-reasoning模型

Phi-4-reasoning是一个基于Phi-4微调的先进推理模型，通过监督微调与强化学习，在数学、科学和编码等领域展现出卓越的推理能力，适用于对推理和逻辑有较高要求的场景。

🚀 快速开始

推理参数设置

推理时，建议使用 temperature=0.8、top_p=0.95 并设置 do_sample=True。对于更复杂的查询，可将最大令牌数设置为 32k，以支持更长的思维链（CoT）。

输入格式

鉴于训练数据的特性，推理时请始终使用 ChatML 模板，并搭配以下系统提示：

<|im_start|>system<|im_sep|>
Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} <\think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|>
<|im_start|>user<|im_sep|>
What is the derivative of x^2?<|im_end|>
<|im_start|>assistant<|im_sep|>

使用 `transformers` 库

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

使用 `vllm` 库

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

Phi-4-reasoning 还可直接在 Ollama、llama.cpp 以及任何与 Phi-4 兼容的框架中使用。

✨ 主要特性

先进的推理能力：基于监督微调与强化学习，在数学、科学和编码等领域展现出卓越的推理能力。
高效的性能：在多个推理任务和通用能力基准测试中表现出色，超越了许多更大参数的开放权重模型。
广泛的适用性：适用于对推理和逻辑有较高要求的场景，如内存/计算受限的环境、低延迟场景等。

📦 安装指南

文档未提及具体安装步骤，可参考相关框架（如 transformers、vllm）的官方文档进行安装。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高级用法

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

📚 详细文档

模型概述

属性	详情
开发者	微软研究院
描述	Phi-4-reasoning 是一个先进的开放权重推理模型，基于 Phi-4 进行监督微调与强化学习。监督微调数据集包含合成提示和来自公共领域网站的高质量过滤数据，专注于数学、科学和编码技能以及安全和负责任 AI 的对齐数据。
架构	基础模型与之前发布的 Phi-4 相同，14B 参数，仅解码器的密集 Transformer 模型
输入	文本，最适合聊天格式的提示
上下文长度	32k 令牌
GPU	32 个 H100-80G
训练时间	2.5 天
训练数据	16B 令牌，约 8.3B 唯一令牌
输出	对输入的生成文本。模型响应分为两个部分，即推理思维链块和总结块
日期	2025 年 1 月 - 2025 年 4 月
状态	基于离线数据集训练的静态模型，公开可用数据截止日期为 2025 年 3 月及更早
发布日期	2025 年 4 月 30 日
许可证	MIT

预期用途

用途类型	详情
主要用例	该模型旨在加速语言模型研究，作为生成式 AI 功能的构建块。适用于需要在内存/计算受限环境、低延迟场景以及推理和逻辑方面有较高要求的通用 AI 系统和应用（主要为英文）。
非预期用例	该模型仅针对数学推理进行设计和测试，并非专门为所有下游用途设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用前评估和缓解准确性、安全性和公平性问题，特别是在高风险场景中。开发者应了解并遵守与用例相关的适用法律或法规（包括隐私、贸易合规法律等），包括模型对英文的专注。选择用例时，请参考下面的负责任 AI 考虑部分以获取更多指导。本模型卡中的任何内容均不应被解释为或视为对模型发布许可证的限制或修改。

用途类型

详情

主要用例

该模型旨在加速语言模型研究，作为生成式 AI 功能的构建块。适用于需要在内存/计算受限环境、低延迟场景以及推理和逻辑方面有较高要求的通用 AI 系统和应用（主要为英文）。

非预期用例

该模型仅针对数学推理进行设计和测试，并非专门为所有下游用途设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用前评估和缓解准确性、安全性和公平性问题，特别是在高风险场景中。开发者应了解并遵守与用例相关的适用法律或法规（包括隐私、贸易合规法律等），包括模型对英文的专注。选择用例时，请参考下面的负责任 AI 考虑部分以获取更多指导。本模型卡中的任何内容均不应被解释为或视为对模型发布许可证的限制或修改。

数据概述

训练数据集

训练数据是数学、科学和编码领域的问答和聊天格式数据的混合。聊天提示来自过滤后的高质量网络数据，并可选择通过合成数据生成管道进行重写和处理。此外，还包括提高真实性和安全性的数据。

基准数据集

使用开源的 Eureka 评估套件和内部基准评估 Phi-4-reasoning 的能力。具体评估任务包括：

推理任务：AIME 2025、2024、2023 和 2022 数学奥林匹克问题、GPQA-Diamond 复杂的研究生级科学问题、OmniMath 超过 4000 个奥林匹克级数学问题的集合、LiveCodeBench 来自竞赛编码比赛的代码生成基准、3SAT 和 TSP 算法问题解决、BA Calendar 规划、Maze 和 SpatialMap 空间理解。
通用基准：Kitab 信息检索、IFEval 和 ArenaHard 指令跟随、PhiBench 内部基准、FlenQA 提示长度对模型性能的影响、HumanEvalPlus 功能代码生成、MMLU-Pro 流行的多任务语言理解聚合数据集。

安全性

方法

Phi-4-reasoning 采用了强大的安全后训练方法，通过监督微调（SFT）。该方法利用了各种开源和内部生成的合成提示，以及符合微软严格安全指南的 LLM 生成响应，例如用户理解和清晰度、安全和道德指南、限制、免责声明和知识范围、处理复杂和敏感主题、安全和尊重互动、指南的保密性和思维链的保密性。

安全评估和红队测试

在发布之前，Phi-4-reasoning 遵循了多方面的评估方法。使用多个开源安全基准和内部工具进行定量评估，利用对抗性对话模拟。为了进行定性安全评估，与微软的独立 AI 红队（AIRT）合作，评估 Phi-4-reasoning 在平均和对抗性用户场景中的安全风险。在平均用户场景中，AIRT 模拟典型的单轮和多轮交互，以识别潜在的风险行为。在对抗性用户场景中，测试了各种旨在故意破坏模型安全训练的技术，包括基础性、越狱、有害内容（如仇恨和不公平、暴力、性内容或自我伤害）以及受保护材料的版权侵犯。还在 Toxigen 基准上评估模型，该基准旨在衡量针对少数群体的偏见和毒性。

模型质量

在代表性基准上对模型质量进行了高级概述。以下表格中，数字越高表示性能越好：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

模型	FlenQA [3K-token subset]	IFEval Strict	ArenaHard	HumanEvalPlus	MMLUPro	Kitab（无上下文 - 精度、有上下文 - 精度、无上下文 - 召回率、有上下文 - 召回率）	Toxigen 判别（有毒类别、中性类别）	PhiBench 2.21
Phi-4	82.0	62.3	68.1	83.5	71.5	19.3 88.5 8.2 68.1	72.6 90.0	58.2
Phi-4-reasoning	97.7	83.4	73.3	92.9	74.3	23.2 91.5 4.9 74.8	86.7 84.7	70.6
Phi-4-reasoning-plus	97.9	84.9	79.0	92.3	76.0	27.6 93.6 6.3 75.4	77.3 90.5	74.2
o3-mini	96.8	91.5	81.9	94.0	79.4	37.9 94.0 4.2 76.1	85.4 88.7	78.0
GPT-4o	90.8	81.8	75.6	88.0	73.0	53.7 84.7 20.3 69.2	87.6 85.1	72.4

总体而言，Phi-4-reasoning 仅 14B 参数，在广泛的推理任务中表现出色，显著超越了许多更大的开放权重模型，如 DeepSeek-R1 蒸馏 70B 模型，并接近完整的 DeepSeek R1 模型的性能水平。在多个新的推理基准测试中，包括 3SAT、TSP 和 BA-Calendar，模型也表现出了强大的泛化能力。此外，在标准通用能力基准测试中，如指令跟随或非推理任务，新模型相比 Phi-4 有了显著改进，尽管后训练主要集中在特定领域的推理技能上。

负责任 AI 考虑

与其他语言模型一样，Phi-4-reasoning 可能会表现出不公平、不可靠或冒犯性的行为。需要注意的一些限制行为包括：

服务质量：模型主要在英文文本上进行训练，非英文语言的性能会较差。训练数据中代表性较少的英文变体可能比标准美式英语的性能更差。Phi-4-reasoning 不支持多语言使用。
伤害的代表性和刻板印象的延续：这些模型可能会过度或不足地代表某些人群，抹去某些群体的代表性。

🔧 技术细节

Phi-4-reasoning 基于 Phi-4 进行监督微调与强化学习，监督微调数据集包含合成提示和来自公共领域网站的高质量过滤数据，专注于数学、科学和编码技能以及安全和负责任 AI 的对齐数据。模型采用了强大的安全后训练方法，通过监督微调（SFT），利用各种开源和内部生成的合成提示，以及符合微软严格安全指南的 LLM 生成响应。