🚀 Phi-3.5-MoE
Phi-3.5-MoE is a lightweight, state-of-the-art open model. It's built on high - quality data, supports multilingual use, and has a 128K context length. It's optimized for various scenarios, especially those with memory and latency constraints.
🚀 Quick Start
Phi-3.5-MoE-instruct is integrated in the official version of transformers
starting from 4.46.0. You can verify the current transformers
version with: pip list | grep transformers
.
The model is also available in Azure AI Studio.
✨ Features
- Lightweight and High - Performance: Built upon high - quality datasets, it offers strong performance in memory and compute - constrained environments.
- Multilingual Support: Supports a wide range of languages, including Arabic, Chinese, English, etc.
- Long Context Length: With a 128K context length, it can handle long - document tasks effectively.
- Rigorous Optimization: Underwent supervised fine - tuning, proximal policy optimization, and direct preference optimization for precise instruction adherence and safety.
📦 Installation
Requirements
Examples of required packages:
flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0
transformers==4.46.0
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-MoE-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
📚 Documentation
Intended Uses
Primary Use Cases
The model is intended for commercial and research use in multiple languages. It's suitable for general - purpose AI systems and applications in:
- Memory/compute constrained environments
- Latency bound scenarios
- Strong reasoning (especially code, math and logic)
It's designed to accelerate research on language and multimodal models as a building block for generative AI features.
Use Case Considerations
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, evaluate and mitigate for accuracy, safety, and fairness before using in a specific downstream use case, especially for high - risk scenarios. They should also adhere to applicable laws and regulations.
Tokenizer
Phi-3.5-MoE-Instruct supports a vocabulary size of up to 32064
tokens. The tokenizer files already provide placeholder tokens for downstream fine - tuning, and they can be extended up to the model's vocabulary size.
Input Formats
Given the nature of the training data, the Phi-3.5-MoE-instruct model is best suited for prompts using the chat format as follows:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
🔧 Technical Details
Benchmarks
To understand the capabilities, we compare Phi-3.5-MoE with a set of models over a variety of benchmarks using our internal benchmark platform.
High - level overview on representative benchmarks
Category |
Benchmark |
Phi-3.5-MoE-instruct |
Mistral-Nemo-12B-instruct-2407 |
Llama-3.1-8B-instruct |
Gemma-2-9b-It |
Gemini-1.5-Flash |
GPT-4o-mini-2024-07-18 (Chat) |
Popular aggregated benchmark |
Arena Hard |
37.9 |
39.4 |
25.7 |
42.0 |
55.2 |
75.0 |
|
BigBench Hard CoT (0-shot) |
79.1 |
60.2 |
63.4 |
63.5 |
66.7 |
80.4 |
|
MMLU (5-shot) |
78.9 |
67.2 |
68.1 |
71.3 |
78.7 |
77.2 |
|
MMLU-Pro (0-shot, CoT) |
54.3 |
40.7 |
44.0 |
50.1 |
57.2 |
62.8 |
Reasoning |
ARC Challenge (10-shot) |
91.0 |
84.8 |
83.1 |
89.8 |
92.8 |
93.5 |
|
BoolQ (2-shot) |
84.6 |
82.5 |
82.8 |
85.7 |
85.8 |
88.7 |
|
GPQA (0-shot, CoT) |
36.8 |
28.6 |
26.3 |
29.2 |
37.5 |
41.1 |
|
HellaSwag (5-shot) |
83.8 |
76.7 |
73.5 |
80.9 |
67.5 |
87.1 |
|
OpenBookQA (10-shot) |
89.6 |
84.4 |
84.8 |
89.6 |
89.0 |
90.0 |
|
PIQA (5-shot) |
88.6 |
83.5 |
81.2 |
83.7 |
87.5 |
88.7 |
|
Social IQA (5-shot) |
78.0 |
75.3 |
71.8 |
74.7 |
77.8 |
82.9 |
|
TruthfulQA (MC2) (10-shot) |
77.5 |
68.1 |
69.2 |
76.6 |
76.6 |
78.2 |
|
WinoGrande (5-shot) |
81.3 |
70.4 |
64.7 |
74.0 |
74.7 |
76.9 |
Multilingual |
Multilingual MMLU (5-shot) |
69.9 |
58.9 |
56.2 |
63.8 |
77.2 |
72.9 |
|
MGSM (0-shot CoT) |
58.7 |
63.3 |
56.7 |
75.1 |
75.8 |
81.7 |
Math |
GSM8K (8-shot, CoT) |
88.7 |
84.2 |
82.4 |
84.9 |
82.4 |
91.3 |
|
MATH (0-shot, CoT) |
59.5 |
31.2 |
47.6 |
50.9 |
38.0 |
70.2 |
Long context |
Qasper |
40.0 |
30.7 |
37.2 |
13.9 |
43.5 |
39.8 |
|
SQuALITY |
24.1 |
25.8 |
26.2 |
0.0 |
23.5 |
23.8 |
Code Generation |
HumanEval (0-shot) |
70.7 |
63.4 |
66.5 |
61.0 |
74.4 |
86.6 |
|
MBPP (3-shot) |
80.8 |
68.1 |
69.4 |
69.3 |
77.5 |
84.1 |
Average |
|
69.2 |
61.3 |
61.0 |
63.3 |
68.5 |
74.9 |
Different categories across 80 public benchmark datasets
Category |
Phi-3.5-MoE-instruct |
Mistral-Nemo-12B-instruct-2407 |
Llama-3.1-8B-instruct |
Gemma-2-9b-It |
Gemini-1.5-Flash |
GPT-4o-mini-2024-07-18 (Chat) |
Popular aggregated benchmark |
62.6 |
51.9 |
50.3 |
56.7 |
64.5 |
73.9 |
Reasoning |
78.7 |
72.2 |
70.5 |
75.4 |
77.7 |
80.0 |
Language understanding |
71.8 |
67.0 |
62.9 |
72.8 |
66.6 |
76.8 |
Robustness |
75.6 |
65.2 |
59.8 |
64.7 |
68.9 |
77.5 |
Long context |
25.5 |
24.5 |
25.5 |
0.0 |
27.0 |
25.4 |
Math |
74.1 |
57.7 |
65.0 |
67.9 |
60.2 |
80.8 |
Code generation |
68.3 |
56.9 |
65.8 |
58.3 |
66.8 |
69.9 |
Multilingual |
65.8 |
55.3 |
47.5 |
59.6 |
64.3 |
76.6 |
Overall, Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of language understanding and math as much larger models. It outperforms bigger models in reasoning capability and is only behind GPT-4o-mini. However, it has limitations for certain tasks due to its size.
Multilingual
The table below highlights the multilingual capability of Phi-3.5-MoE on multilingual MMLU, MEGA, and multilingual MMLU - pro datasets.
Category |
Phi-3.5-MoE-instruct |
Mistral-Nemo-12B-instruct-2407 |
Llama-3.1-8B-instruct |
Gemma-2-9b-It |
Gemini-1.5-Flash |
GPT-4o-mini-2024-07-18 (Chat) |
Multilingual MMLU |
69.9 |
58.9 |
56.2 |
63.8 |
77.2 |
72.9 |
Multilingual MMLU-Pro |
45.3 |
34.0 |
21.4 |
43.0 |
57.9 |
53.2 |
MGSM |
58.7 |
63.3 |
56.7 |
75.1 |
75.8 |
81.7 |
MEGA MLQA |
65.3 |
61.2 |
45.2 |
54.4 |
61.6 |
70.0 |
MEGA TyDi QA |
67.1 |
63.7 |
54.5 |
65.6 |
63.6 |
81.8 |
MEGA UDPOS |
60.4 |
58.2 |
54.1 |
56.6 |
62.4 |
66.0 |
MEGA XCOPA |
76.6 |
10.8 |
21.1 |
31.2 |
95.0 |
90.3 |
MEGA XStoryCloze |
82.8 |
92.3 |
71.0 |
87.0 |
20.7 |
96.6 |
Average |
65.8 |
55.3 |
47.5 |
59.6 |
64.3 |
76.6 |
Long Context
Phi-3.5-MoE supports 128K context length, capable of long - document/meeting summarization, long - document QA, and multilingual context retrieval.
Benchmark |
Phi-3.5-MoE-instruct |
Mistral-Nemo-12B-instruct-2407 |
Llama-3.1-8B-instruct |
Gemini-1.5-Flash |
GPT-4o-mini-2024-07-18 (Chat) |
GovReport |
26.4 |
25.6 |
25.1 |
27.8 |
24.8 |
QMSum |
19.9 |
22.1 |
21.6 |
24.0 |
21.7 |
Qasper |
40.0 |
30.7 |
37.2 |
43.5 |
39.8 |
SQuALITY |
24.1 |
25.8 |
26.2 |
23.5 |
23.8 |
SummScreenFD |
16.9 |
18.2 |
17.6 |
16.3 |
17.0 |
Average |
25.5 |
24.5 |
25.5 |
27.0 |
25.4 |
RULER: a retrieval - based benchmark for long context understanding
Model |
4K |
8K |
16K |
32K |
64K |
128K |
Average |
Phi-3.5-MoE-instruct |
94.8 |
93 |
93.2 |
91.6 |
85.7 |
64.2 |
87.1 |
Llama-3.1-8B-instruct |
95.5 |
93.8 |
91.6 |
87.4 |
84.7 |
77.0 |
88.3 |
Mistral-Nemo-12B-instruct-2407 |
87.8 |
87.2 |
87.7 |
69.0 |
46.8 |
19.0 |
66.2 |
RepoQA: a benchmark for long context code understanding
Model |
Python |
C++ |
Rust |
Java |
TypeScript |
Average |
Phi-3.5-MoE-instruct |
89 |
74 |
81 |
88 |
95 |
85 |
Llama-3.1-8B-instruct |
80 |
65 |
73 |
76 |
63 |
71 |
Mistral-7B-instruct-v0.3 |
61 |
57 |
51 |
61 |
80 |
62 |
Training
Model
Property |
Details |
Architecture |
Phi-3.5-MoE has 16x3.8B parameters with 6.6B active parameters when using 2 experts. It's a mixture - of - expert decoder - only Transformer model using the tokenizer with vocabulary size of 32,064. |
Inputs |
Text. Best suited for prompts using chat format. |
Context length |
128K tokens |
GPUs |
512 H100 - 80G |
Training time |
23 days |
Training data |
4.9T tokens |
Outputs |
Generated text in response to the input |
Dates |
Trained between April and August 2024 |
Status |
This is a static model trained on an offline dataset with cutoff date October 2023 for publicly available data. Future versions of the tuned models may be released as we improve models. |
Supported languages |
Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian |
Release date |
August 2024 |
Training Datasets
Our training data includes a wide variety of sources, totaling 4.9 trillion tokens (including 10% multilingual), and is a combination of:
- Publicly available documents filtered rigorously for quality, selected high - quality educational data, and code.
- Newly created synthetic, “textbook - like” data for teaching math, coding, common sense reasoning, general knowledge of the world.
- High - quality chat - format supervised data covering various topics to reflect human preferences.
📄 License
The model is released under the MIT license.