Phi-3.5-MoE-instruct Open-Source Model - Supports Multiple Languages and Processes Reasoning Data with High Quality and Efficiency!

Phi 3.5 MoE Instruct

Developed by microsoft

Phi-3.5-MoE is a lightweight, state-of-the-art open-source model supporting multiple languages with a 128K context length, focusing on high-quality, inference-intensive data.

Large Language Model

Transformers

Open Source License:MIT #Lightweight Inference Expert #128K Long Context #Multilingual Code Generation

Downloads 40.25k

Release Time : 8/17/2024

Model Overview

Phi-3.5-MoE is a lightweight open-source model built on the Phi-3 dataset, supporting multiple languages, suitable for memory/computation-constrained environments, latency-sensitive scenarios, and strong inference needs (especially for code, mathematics, and logic).

Model Features

Lightweight and Efficient

Uses only 6.6B active parameters, outperforming larger models in inference capabilities.

Multilingual Support

Supports 23 languages and excels in multilingual tasks.

Long Context Handling

Supports a 128K context length, suitable for processing long documents and complex tasks.

Strong Reasoning Capabilities

Excels in code, mathematics, and logical reasoning.

Model Capabilities

Text Generation

Multilingual Processing

Code Generation

Mathematical Reasoning

Logical Reasoning

Long Context Understanding

Use Cases

General AI Applications

Chat Assistant

Provides multilingual chat assistant functionality for various conversational scenarios.

Performs excellently in dialogue tasks, supporting complex conversations and multi-turn interactions.

Code Generation

Generates high-quality code snippets supporting multiple programming languages.

Performs excellently in HumanEval and MBPP benchmark tests.

Education

Math Problem Solving

Solves complex math problems with detailed reasoning processes.

Performs excellently in GSM8K and MATH benchmark tests.

Language Learning

Supports multilingual learning and practice, providing language-related Q&A and explanations.

Performs excellently in multilingual MMLU and MGSM benchmark tests.

🚀 Phi-3.5-MoE

Phi-3.5-MoE is a lightweight, state-of-the-art open model. It's built on high - quality data, supports multilingual use, and has a 128K context length. It's optimized for various scenarios, especially those with memory and latency constraints.

🚀 Quick Start

Phi-3.5-MoE-instruct is integrated in the official version of transformers starting from 4.46.0. You can verify the current transformers version with: pip list | grep transformers.

The model is also available in Azure AI Studio.

✨ Features

Lightweight and High - Performance: Built upon high - quality datasets, it offers strong performance in memory and compute - constrained environments.
Multilingual Support: Supports a wide range of languages, including Arabic, Chinese, English, etc.
Long Context Length: With a 128K context length, it can handle long - document tasks effectively.
Rigorous Optimization: Underwent supervised fine - tuning, proximal policy optimization, and direct preference optimization for precise instruction adherence and safety.

📦 Installation

Requirements

Examples of required packages:

flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0
transformers==4.46.0

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3.5-MoE-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=False,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

📚 Documentation

Intended Uses

Primary Use Cases

The model is intended for commercial and research use in multiple languages. It's suitable for general - purpose AI systems and applications in:

Memory/compute constrained environments
Latency bound scenarios
Strong reasoning (especially code, math and logic)

It's designed to accelerate research on language and multimodal models as a building block for generative AI features.

Use Case Considerations

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, evaluate and mitigate for accuracy, safety, and fairness before using in a specific downstream use case, especially for high - risk scenarios. They should also adhere to applicable laws and regulations.

Tokenizer

Phi-3.5-MoE-Instruct supports a vocabulary size of up to 32064 tokens. The tokenizer files already provide placeholder tokens for downstream fine - tuning, and they can be extended up to the model's vocabulary size.

Input Formats

Given the nature of the training data, the Phi-3.5-MoE-instruct model is best suited for prompts using the chat format as follows:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

🔧 Technical Details

Benchmarks

To understand the capabilities, we compare Phi-3.5-MoE with a set of models over a variety of benchmarks using our internal benchmark platform.

High - level overview on representative benchmarks

Category	Benchmark	Phi-3.5-MoE-instruct	Mistral-Nemo-12B-instruct-2407	Llama-3.1-8B-instruct	Gemma-2-9b-It	Gemini-1.5-Flash	GPT-4o-mini-2024-07-18 (Chat)
Popular aggregated benchmark	Arena Hard	37.9	39.4	25.7	42.0	55.2	75.0
	BigBench Hard CoT (0-shot)	79.1	60.2	63.4	63.5	66.7	80.4
	MMLU (5-shot)	78.9	67.2	68.1	71.3	78.7	77.2
	MMLU-Pro (0-shot, CoT)	54.3	40.7	44.0	50.1	57.2	62.8
Reasoning	ARC Challenge (10-shot)	91.0	84.8	83.1	89.8	92.8	93.5
	BoolQ (2-shot)	84.6	82.5	82.8	85.7	85.8	88.7
	GPQA (0-shot, CoT)	36.8	28.6	26.3	29.2	37.5	41.1
	HellaSwag (5-shot)	83.8	76.7	73.5	80.9	67.5	87.1
	OpenBookQA (10-shot)	89.6	84.4	84.8	89.6	89.0	90.0
	PIQA (5-shot)	88.6	83.5	81.2	83.7	87.5	88.7
	Social IQA (5-shot)	78.0	75.3	71.8	74.7	77.8	82.9
	TruthfulQA (MC2) (10-shot)	77.5	68.1	69.2	76.6	76.6	78.2
	WinoGrande (5-shot)	81.3	70.4	64.7	74.0	74.7	76.9
Multilingual	Multilingual MMLU (5-shot)	69.9	58.9	56.2	63.8	77.2	72.9
	MGSM (0-shot CoT)	58.7	63.3	56.7	75.1	75.8	81.7
Math	GSM8K (8-shot, CoT)	88.7	84.2	82.4	84.9	82.4	91.3
	MATH (0-shot, CoT)	59.5	31.2	47.6	50.9	38.0	70.2
Long context	Qasper	40.0	30.7	37.2	13.9	43.5	39.8
	SQuALITY	24.1	25.8	26.2	0.0	23.5	23.8
Code Generation	HumanEval (0-shot)	70.7	63.4	66.5	61.0	74.4	86.6
	MBPP (3-shot)	80.8	68.1	69.4	69.3	77.5	84.1
Average		69.2	61.3	61.0	63.3	68.5	74.9

Different categories across 80 public benchmark datasets

Category	Phi-3.5-MoE-instruct	Mistral-Nemo-12B-instruct-2407	Llama-3.1-8B-instruct	Gemma-2-9b-It	Gemini-1.5-Flash	GPT-4o-mini-2024-07-18 (Chat)
Popular aggregated benchmark	62.6	51.9	50.3	56.7	64.5	73.9
Reasoning	78.7	72.2	70.5	75.4	77.7	80.0
Language understanding	71.8	67.0	62.9	72.8	66.6	76.8
Robustness	75.6	65.2	59.8	64.7	68.9	77.5
Long context	25.5	24.5	25.5	0.0	27.0	25.4
Math	74.1	57.7	65.0	67.9	60.2	80.8
Code generation	68.3	56.9	65.8	58.3	66.8	69.9
Multilingual	65.8	55.3	47.5	59.6	64.3	76.6

Overall, Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of language understanding and math as much larger models. It outperforms bigger models in reasoning capability and is only behind GPT-4o-mini. However, it has limitations for certain tasks due to its size.

Multilingual

The table below highlights the multilingual capability of Phi-3.5-MoE on multilingual MMLU, MEGA, and multilingual MMLU - pro datasets.

Category	Phi-3.5-MoE-instruct	Mistral-Nemo-12B-instruct-2407	Llama-3.1-8B-instruct	Gemma-2-9b-It	Gemini-1.5-Flash	GPT-4o-mini-2024-07-18 (Chat)
Multilingual MMLU	69.9	58.9	56.2	63.8	77.2	72.9
Multilingual MMLU-Pro	45.3	34.0	21.4	43.0	57.9	53.2
MGSM	58.7	63.3	56.7	75.1	75.8	81.7
MEGA MLQA	65.3	61.2	45.2	54.4	61.6	70.0
MEGA TyDi QA	67.1	63.7	54.5	65.6	63.6	81.8
MEGA UDPOS	60.4	58.2	54.1	56.6	62.4	66.0
MEGA XCOPA	76.6	10.8	21.1	31.2	95.0	90.3
MEGA XStoryCloze	82.8	92.3	71.0	87.0	20.7	96.6
Average	65.8	55.3	47.5	59.6	64.3	76.6

Long Context

Phi-3.5-MoE supports 128K context length, capable of long - document/meeting summarization, long - document QA, and multilingual context retrieval.

Benchmark	Phi-3.5-MoE-instruct	Mistral-Nemo-12B-instruct-2407	Llama-3.1-8B-instruct	Gemini-1.5-Flash	GPT-4o-mini-2024-07-18 (Chat)
GovReport	26.4	25.6	25.1	27.8	24.8
QMSum	19.9	22.1	21.6	24.0	21.7
Qasper	40.0	30.7	37.2	43.5	39.8
SQuALITY	24.1	25.8	26.2	23.5	23.8
SummScreenFD	16.9	18.2	17.6	16.3	17.0
Average	25.5	24.5	25.5	27.0	25.4

RULER: a retrieval - based benchmark for long context understanding

Model	4K	8K	16K	32K	64K	128K	Average
Phi-3.5-MoE-instruct	94.8	93	93.2	91.6	85.7	64.2	87.1
Llama-3.1-8B-instruct	95.5	93.8	91.6	87.4	84.7	77.0	88.3
Mistral-Nemo-12B-instruct-2407	87.8	87.2	87.7	69.0	46.8	19.0	66.2

RepoQA: a benchmark for long context code understanding

Model	Python	C++	Rust	Java	TypeScript	Average
Phi-3.5-MoE-instruct	89	74	81	88	95	85
Llama-3.1-8B-instruct	80	65	73	76	63	71
Mistral-7B-instruct-v0.3	61	57	51	61	80	62

Training

Model

Property	Details
Architecture	Phi-3.5-MoE has 16x3.8B parameters with 6.6B active parameters when using 2 experts. It's a mixture - of - expert decoder - only Transformer model using the tokenizer with vocabulary size of 32,064.
Inputs	Text. Best suited for prompts using chat format.
Context length	128K tokens
GPUs	512 H100 - 80G
Training time	23 days
Training data	4.9T tokens
Outputs	Generated text in response to the input
Dates	Trained between April and August 2024
Status	This is a static model trained on an offline dataset with cutoff date October 2023 for publicly available data. Future versions of the tuned models may be released as we improve models.
Supported languages	Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Release date	August 2024

Training Datasets

Our training data includes a wide variety of sources, totaling 4.9 trillion tokens (including 10% multilingual), and is a combination of:

Publicly available documents filtered rigorously for quality, selected high - quality educational data, and code.
Newly created synthetic, “textbook - like” data for teaching math, coding, common sense reasoning, general knowledge of the world.
High - quality chat - format supervised data covering various topics to reflect human preferences.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご