🚀 Phi-4-mini-reasoning
Phi-4-mini-reasoning is a lightweight open model. It focuses on high - quality, reasoning - dense data and is finetuned for advanced math reasoning. It supports a 128K token context length, offering efficient performance in math - related tasks.
🚀 Quick Start
To quickly start using Phi-4-mini-reasoning, you need to set up the necessary environment. First, ensure you have the required packages installed:
flash_attn==2.7.4.post1
torch==2.5.1
transformers==4.51.3
accelerate==1.3.0
After obtaining the model checkpoints, you can use the following sample code for inference:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-4-mini-reasoning"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{
"role": "user",
"content": "How to solve 3*x^2+4*x+5=1?"
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=32768,
temperature=0.8,
top_p=0.95,
do_sample=True,
)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
✨ Features
- Lightweight and Efficient: With only 3.8B parameters, it achieves a similar level of multilingual language understanding and reasoning ability as much larger models, suitable for memory/compute - constrained environments.
- Advanced Math Reasoning: Focuses on high - quality, reasoning - dense data and is finetuned for more advanced math reasoning capabilities, excelling in multi - step, logic - intensive mathematical problem - solving.
- Large Context Length: Supports a 128K token context length, maintaining context across steps in problem - solving.
📚 Documentation
Model Summary
Phi-4-mini-reasoning is a lightweight open model built upon synthetic data with a focus on high - quality, reasoning dense data. It is further finetuned for more advanced math reasoning capabilities. The model belongs to the Phi - 4 model family and supports 128K token context length.
Phi - 4 models:
Intended Uses
Primary Use Cases
Phi-4-mini-reasoning is designed for multi - step, logic - intensive mathematical problem - solving tasks under memory/compute constrained environments and latency bound scenarios. Use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios.
Use Case Considerations
This model is designed and tested for math reasoning only. Developers should consider common limitations of language models, performance differences across languages, and evaluate and mitigate for accuracy, safety, and fairness before using it in specific downstream use cases, especially high - risk scenarios. They should also adhere to applicable laws or regulations.
Release Notes
This release of Phi-4-mini-reasoning addresses user feedback and market demand for a compact reasoning model. It is a compact transformer - based language model optimized for mathematical reasoning, delivering high - quality, step - by - step problem solving in constrained environments.
Model Quality
Model |
AIME |
MATH - 500 |
GPQA Diamond |
o1 - mini* |
63.6 |
90.0 |
60.0 |
DeepSeek - R1 - Distill - Qwen - 7B |
53.3 |
91.4 |
49.5 |
DeepSeek - R1 - Distill - Llama - 8B |
43.3 |
86.9 |
47.3 |
Bespoke - Stratos - 7B* |
20.0 |
82.0 |
37.8 |
OpenThinker - 7B* |
31.3 |
83.0 |
42.4 |
Llama - 3.2 - 3B - Instruct |
6.7 |
44.4 |
25.3 |
Phi - 4 - Mini (base model, 3.8B) |
10.0 |
71.8 |
36.9 |
Phi - 4 - mini - reasoning (3.8B) |
57.5 |
94.6 |
52.0 |
Overall, the 3.8B - param model achieves a similar level of multilingual language understanding and reasoning ability as larger models. However, it is limited by its size for certain tasks, and users may experience factual incorrectness. This weakness may be resolved by augmenting it with a search engine under RAG settings.
Usage
Tokenizer
Phi-4-mini-reasoning supports a vocabulary size of up to 200064
tokens. The tokenizer files provide placeholder tokens for downstream fine - tuning and can be extended up to the model's vocabulary size.
Input Formats
The Phi-4-mini-instruct model is best suited for prompts using specific formats:
<|system|>Your name is Phi, an AI math expert developed by Microsoft.<|end|><|user|>How to solve 3*x^2+4*x+5=1?<|end|><|assistant|>
Training
Model
- Architecture: Shares the same architecture as Phi - 4 - Mini, a 3.8B - parameter dense decoder - only Transformer model. Major changes compared to Phi - 3.5 - Mini are 200K vocabulary, grouped - query attention, and shared input and output embedding.
- Inputs: Text, best suited for chat - format prompts.
- Context length: 128K tokens
- GPUs: 128 H100 - 80G
- Training time: 2 days
- Training data: 150B tokens
- Outputs: Generated text
- Dates: Trained in February 2024
- Status: A static model trained on offline datasets with a cutoff date of February 2025 for publicly available data.
- Supported languages: English
- Release date: April 2025
Training Datasets
The training data consists of synthetic mathematical content generated by Deepseek - R1. The synthetic dataset has over one million diverse math problems, and about 30 billion tokens of math content are retained after verification. The dataset integrates three components:
- High - quality, publicly available math questions and part of the SFT data for the base Phi - 4 - Mini model.
- Synthetic math data generated by Deepseek - R1 for supervised fine - tuning and model distillation.
- Preference data with correct and incorrect answers to enhance reasoning capabilities.
Software
Hardware
The Phi-4-mini-reasoning model uses flash attention by default, requiring specific GPU hardware. It has been tested on NVIDIA A100 and NVIDIA H100. If you want to run it on NVIDIA V100 or earlier generation GPUs, call AutoModelForCausalLM.from_pretrained()
with attn_implementation="eager"
.
Safety Evaluation and Red - Teaming
The Phi - 4 family of models uses a robust safety post - training approach, combining SFT, DPO, and RLHF with human - labeled and synthetic English - language datasets. Phi - 4 - Mini - Reasoning was developed according to Microsoft's responsible AI principles, and its safety was assessed using the Azure AI Foundry's framework.
Responsible AI Considerations
Developers should be aware of potential limitations such as unfairness, unreliability, offensive content, and information inaccuracy. They should apply responsible AI best practices, fine - tune the model for specific use cases, and implement appropriate safeguards.
Appendix A: Benchmark Methodology
We aim to ensure fair comparisons in benchmarks by using the same generation configuration. The model is evaluated on three popular math benchmarks:
- Math - 500: Consists of 500 challenging math problems for complex reasoning and problem - solving.
- AIME 2024: A highly regarded math competition with difficult problems for assessing advanced skills.
- GPQA Diamond: Focuses on evaluating the model's ability to solve a wide range of mathematical questions.
📄 License
The model is licensed under the MIT license.
Trademarks
This project may contain trademarks or logos. Authorized use of Microsoft trademarks or logos must follow Microsoft’s Trademark & Brand Guidelines. Use of third - party trademarks or logos is subject to their policies.