🚀 Phi-4-mini-reasoning
Phi-4-mini-reasoning is a lightweight open model focused on high-quality, reasoning dense data, finetuned for advanced math reasoning capabilities.
🚀 Quick Start
Phi-4-mini-reasoning is a powerful model for mathematical reasoning. To get started, you can refer to the usage section below for details on tokenization, input formats, and inference.
✨ Features
- Optimized for Math Reasoning: Designed for multi - step, logic - intensive mathematical problem - solving tasks, especially in memory/compute constrained environments and latency bound scenarios.
- High - Quality Output: Capable of maintaining context across steps, applying structured logic, and delivering accurate solutions in mathematical reasoning domains.
- Compact Size: Balances reasoning ability with efficiency, suitable for educational applications, embedded tutoring, and lightweight deployment on edge or mobile systems.
📦 Installation
Phi-4-mini-reasoning has been integrated in the 4.51.3
version of transformers
. You can verify the current transformers
version with: pip list | grep transformers
. Python 3.8 and 3.10 work best. The required packages are as follows:
flash_attn==2.7.4.post1
torch==2.5.1
transformers==4.51.3
accelerate==1.3.0
💻 Usage Examples
Basic Usage
Tokenizer
Phi-4-mini-reasoning supports a vocabulary size of up to 200064
tokens. The tokenizer files already provide placeholder tokens for downstream fine - tuning, and can be extended up to the model's vocabulary size.
Input Formats
The Phi-4-mini-instruct model is best suited for prompts using specific formats. The two primary formats are:
Chat format
This format is used for general conversation and instructions:
<|system|>Your name is Phi, an AI math expert developed by Microsoft.<|end|><|user|>How to solve 3*x^2+4*x+5=1?<|end|><|assistant|>
Inference
After obtaining the Phi-4-mini-instruct model checkpoints, you can use the following sample code for inference:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-4-mini-reasoning"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{
"role": "user",
"content": "How to solve 3*x^2+4*x+5=1?"
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=32768,
temperature=0.8,
top_p=0.95,
do_sample=True,
)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
📚 Documentation
Intended Uses
Primary Use Cases
Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks under memory/compute constrained environments and latency bound scenarios. Use cases include formal proof generation, symbolic computation, advanced word problems, and various mathematical reasoning scenarios.
Use Case Considerations
This model is designed and tested for math reasoning only. Developers should consider common limitations of language models, performance differences across languages, and evaluate and mitigate for accuracy, safety, and fairness before using in specific downstream use cases, especially high - risk scenarios. They should also adhere to applicable laws or regulations.
Release Notes
This release of Phi-4-mini-reasoning addresses user feedback and market demand for a compact reasoning model. It is optimized for mathematical reasoning, fine - tuned with synthetic math data, and balances reasoning ability with efficiency.
Model Quality
The 3.8B parameters Phi-4-mini-reasoning model was compared with a set of models over a variety of reasoning benchmarks:
Model |
AIME |
MATH-500 |
GPQA Diamond |
o1-mini* |
63.6 |
90.0 |
60.0 |
DeepSeek-R1-Distill-Qwen-7B |
53.3 |
91.4 |
49.5 |
DeepSeek-R1-Distill-Llama-8B |
43.3 |
86.9 |
47.3 |
Bespoke-Stratos-7B* |
20.0 |
82.0 |
37.8 |
OpenThinker-7B* |
31.3 |
83.0 |
42.4 |
Llama-3.2-3B-Instruct |
6.7 |
44.4 |
25.3 |
Phi-4-Mini (base model, 3.8B) |
10.0 |
71.8 |
36.9 |
Phi-4-mini-reasoning (3.8B) |
57.5 |
94.6 |
52.0 |
Overall, the 3.8B - param model achieves a similar level of multilingual language understanding and reasoning ability as much larger models, but has limitations due to its size.
Training
Model
Property |
Details |
Architecture |
Shares the same architecture as Phi-4-Mini, a dense decoder-only Transformer model with 3.8B parameters. Major changes compared to Phi-3.5-Mini are 200K vocabulary, grouped-query attention, and shared input and output embedding. |
Inputs |
Text, best suited for prompts in chat format. |
Context length |
128K tokens |
GPUs |
128 H100 - 80G |
Training time |
2 days |
Training data |
150B tokens |
Outputs |
Generated text |
Dates |
Trained in February 2024 |
Status |
A static model trained on offline datasets with a cutoff date of February 2025 for publicly available data. |
Supported languages |
English |
Release date |
April 2025 |
Training Datasets
The training data consists of synthetic mathematical content generated by Deepseek - R1. It includes over one million diverse math problems, and about 30 billion tokens of math content after verification. The dataset integrates three primary components: curated high - quality math questions, synthetic math data generated by Deepseek - R1, and preference data for enhancing reasoning capabilities.
Software
Hardware
The Phi-4-mini-reasoning model uses flash attention by default, which requires certain types of GPU hardware. It has been tested on NVIDIA A100 and NVIDIA H100. If you want to run the model on NVIDIA V100 or earlier generation GPUs, call AutoModelForCausalLM.from_pretrained()
with attn_implementation="eager"
.
Safety Evaluation and Red - Teaming
The Phi-4 family of models adopts a robust safety post - training approach using a variety of datasets. Phi-4-Mini-Reasoning was developed in accordance with Microsoft's responsible AI principles, and its safety risks were assessed using the Azure AI Foundry's Risk and Safety Evaluation framework.
Responsible AI Considerations
The Phi family of models has potential limitations such as unfairness, unreliability, or offensiveness. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks according to their specific use cases and contexts.
License
The model is licensed under the MIT license.
Trademarks
This project may contain trademarks or logos. Authorized use of Microsoft trademarks or logos must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en - us/legal/intellectualproperty/trademarks). Use of third - party trademarks or logos is subject to their policies.
Appendix A: Benchmark Methodology
We aim to ensure an apples - to - apples comparison in benchmarks by using the same generation configuration. The model is evaluated with three popular math benchmarks: Math - 500, AIME 2024, and GPQA Diamond.
⚠️ Important Note
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
💡 Usage Tip
The model has an elevated defect rate when responding to election - critical queries. Users should verify election - related information with the election authority in their region. Also, for non - English languages, performance may be worse, and developers should test and customize the model as needed.