Phi-3 Mini 128K Instruct開源模型 - 輕量級，超強推理支持128K上下文

首頁

Phi 3 Mini 128k Instruct

由microsoft開發

Phi-3 Mini 128K Instruct是一個38億參數的輕量級開源模型，專注於推理能力，支持128K上下文長度。

大型語言模型

Transformers

支持多種語言開源協議:MIT #輕量級推理 #長上下文處理 #代碼生成

下載量 399.68k

發布時間 : 4/22/2024

模型概述

該模型是Phi-3系列的一部分，經過訓練優化指令遵循和安全性，在常識、語言理解、數學、編碼和邏輯推理方面表現優異。

模型特點

長上下文支持

支持128K token的上下文長度，適合處理長文檔和複雜對話場景。

輕量高效

僅38億參數，在資源受限環境中仍能提供高性能推理。

多領域能力

在代碼、數學和邏輯推理等需要強推理能力的領域表現突出。

安全優化

經過專門訓練以遵循安全措施和負責任AI準則。

模型能力

文本生成

代碼理解與生成

數學問題求解

邏輯推理

長文檔處理

多輪對話

使用案例

開發輔助

代碼生成與解釋

幫助開發者生成代碼片段或解釋複雜代碼邏輯

提升開發效率，降低學習曲線

教育

數學問題解答

解答各類數學問題並提供解題步驟

輔助數學學習和教學

商業分析

長文檔摘要

處理和分析長篇幅商業文檔

快速提取關鍵信息，提高決策效率

🚀 Phi-4

Phi-4提供了多種版本，包括多模態指令版本和ONNX版本，以及迷你指令版本和對應的ONNX版本：

[多模態指令 | ONNX]
[迷你指令 | ONNX]

Phi-4是一款功能強大的模型，適用於多種自然語言處理任務，能為用戶提供高效、準確的文本生成服務。

🚀 快速開始

Phi-3-Mini-128K-Instruct已集成到transformers的開發版本（4.41.3）中。在通過pip發佈官方版本之前，你可以採取以下操作：

加載模型時，確保在from_pretrained()函數中傳入trust_remote_code=True參數。
將本地的transformers更新到開發版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。此命令是從源代碼克隆並安裝的替代方法。

可以使用pip list | grep transformers來驗證當前transformers的版本。

示例代碼

以下代碼展示瞭如何在GPU上快速運行該模型：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

注意：如果你想使用閃存注意力機制，可以在調用AutoModelForCausalLM.from_pretrained()時傳入attn_implementation="flash_attention_2"。

✨ 主要特性

模型概述

Phi-3-Mini-128K-Instruct是一個具有38億參數的輕量級、最先進的開放模型，使用Phi-3數據集進行訓練。該數據集包含合成數據和經過篩選的公開網站數據，強調高質量和富含推理的特性。

該模型屬於Phi-3系列的Mini版本，有兩種變體：4K和128K，分別表示其支持的上下文長度（以標記為單位）。

初始訓練後，模型經過了包括監督微調（SFT）和直接偏好優化（DPO）的後訓練過程，以增強其遵循指令和遵守安全措施的能力。在針對常識、語言理解、數學、編碼、長期上下文和邏輯推理等基準測試中，Phi-3 Mini-128K-Instruct在參數少於130億的模型中表現出了強大的性能。

資源與技術文檔

適用場景

主要用例：該模型適用於英語的商業和研究用途，可用於以下需求的應用程序：
- 內存/計算受限的環境
- 低延遲場景
- 強推理能力（特別是代碼、數學和邏輯方面）
用例考慮：我們的模型並非專門為所有下游用途設計或評估。開發者在選擇用例時應考慮語言模型的常見侷限性，並在特定下游用例中使用之前評估和緩解準確性、安全性和公平性問題，特別是在高風險場景中。開發者應瞭解並遵守與其用例相關的適用法律或法規（包括隱私、貿易合規法等）。

版本更新說明

本次更新是基於寶貴的客戶反饋對原始指令微調的Phi-3-mini版本進行的改進。模型使用了額外的後訓練數據，在長期上下文理解、指令遵循和結構化輸出方面取得了顯著提升。同時，我們還提高了多輪對話質量，明確支持<|system|>標籤，並顯著增強了推理能力。

我們相信大多數用例將從本次發佈中受益，但建議用戶在其特定的AI應用中進行測試。我們感謝社區對Phi-3模型系列的熱情采用，並繼續歡迎來自社區的所有反饋。

以下表格展示了新版本在公開和內部基準數據集上的指令遵循、結構化輸出、推理和長期上下文理解方面的改進：

基準測試	原始版本	2024年6月更新版本
指令超難測試	5.7	5.9
指令難測試	5.0	5.2
JSON結構化輸出	1.9	60.1
XML結構化輸出	47.8	52.9
GPQA	25.9	29.7
MMLU	68.1	69.7
平均	25.7	37.3

RULER：基於檢索的長期上下文理解基準測試

模型	4K	8K	16K	32K	64K	128K	平均
原始版本	86.7	78.1	75.6	70.3	58.9	43.3	68.8
2024年6月更新版本	92.4	91.1	90.8	87.9	79.8	65.6	84.6

RepoQA：長期上下文代碼理解基準測試

模型	Python	C++	Rust	Java	TypeScript	平均
原始版本	27	29	40	33	33	32.4
2024年6月更新版本	85	63	72	93	72	77

注意：如果用戶想查看之前的版本，可以使用git提交ID bb5bf1e4001277a606e11debca0ef80323e5f824。對於模型轉換（如GGUF和其他格式），我們邀請社區嘗試各種方法並分享寶貴的反饋。

📦 安裝指南

Phi-3-Mini-128K-Instruct已集成到transformers的開發版本（4.41.3）中。在通過pip發佈官方版本之前，你可以採取以下操作：

加載模型時，確保在from_pretrained()函數中傳入trust_remote_code=True參數。
將本地的transformers更新到開發版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。此命令是從源代碼克隆並安裝的替代方法。

可以使用pip list | grep transformers來驗證當前transformers的版本。

所需包示例

flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0
transformers==4.41.2

💻 使用示例

基礎用法

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

高級用法

如果你想使用閃存注意力機制，可以在調用AutoModelForCausalLM.from_pretrained()時傳入attn_implementation="flash_attention_2"：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,
    attn_implementation="flash_attention_2" 
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

📚 詳細文檔

分詞器

Phi-3 Mini-128K-Instruct支持最多32064個標記的詞彙表。分詞器文件已經提供了可用於下游微調的佔位符標記，但也可以擴展到模型的詞彙表大小。

聊天格式

由於訓練數據的性質，Phi-3 Mini-128K-Instruct模型最適合使用以下聊天格式的提示：你可以使用通用模板將提示作為問題提供：

<|system|>
You are a helpful assistant.<|end|>
<|user|>
Question?<|end|>
<|assistant|>

例如：

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

模型將在<|assistant|>之後生成文本。在少樣本提示的情況下，提示可以格式化為以下形式：

<|system|>
You are a helpful travel assistant.<|end|>
<|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."<|end|>
<|user|>
What is so great about #1?<|end|>
<|assistant|>

🔧 技術細節

模型

架構：Phi-3 Mini-128K-Instruct有38億參數，是一個密集的僅解碼器Transformer模型。該模型通過監督微調（SFT）和直接偏好優化（DPO）進行微調，以確保與人類偏好和安全準則保持一致。
輸入：文本。最適合使用聊天格式的提示。
上下文長度：128K標記
GPU：512個H100-80G
訓練時間：10天
訓練數據：4.9T標記
輸出：對輸入的生成文本響應
日期：我們的模型在2024年5月至6月期間進行訓練
狀態：這是一個基於截止日期為2023年10月的離線數據集訓練的靜態模型。隨著我們對模型的改進，未來可能會發布微調模型的新版本。
發佈日期：2024年6月

數據集

我們的訓練數據包括來自多種來源的總共4.9萬億個標記，是以下數據的組合：

經過嚴格質量篩選的公開可用文檔、選定的高質量教育數據和代碼。
為教授數學、編碼、常識推理、世界常識（科學、日常活動、心智理論等）而新創建的合成“教科書式”數據。
涵蓋各種主題的高質量聊天格式監督數據，以反映人類在指令遵循、真實性、誠實性和幫助性等不同方面的偏好。

我們專注於可能提高模型推理能力的數據質量，並篩選公開可用文檔以包含適當水平的知識。例如，某一天英超聯賽的比賽結果可能是前沿模型的良好訓練數據，但對於小尺寸模型，我們需要去除此類信息，以便為推理留出更多模型容量。有關數據的更多詳細信息，請參閱Phi-3技術報告。

微調

此處提供了一個使用TRL和Accelerate模塊進行多GPU監督微調（SFT）的基本示例。

基準測試

我們報告了Phi-3-Mini-128K-Instruct在標準開源基準測試中的完成格式結果，這些基準測試用於衡量模型的推理能力（包括常識推理和邏輯推理）。我們將其與Mistral-7b-v0.1、Mixtral-8x7b、Gemma 7B、Llama-3-8B-Instruct和GPT-3.5進行了比較。

所有報告的數字都是使用完全相同的管道生成的，以確保數字具有可比性。由於評估中的細微差異，這些數字可能與其他公佈的數字有所不同。

按照現在的標準，我們使用少樣本提示在溫度為0的情況下評估模型。提示和少樣本數量是微軟內部評估語言模型工具的一部分，特別是我們沒有對Phi-3的管道進行優化。具體來說，我們沒有更改提示、選擇不同的少樣本示例、更改提示格式或對模型進行任何其他形式的優化。

每個基準測試列出了少樣本示例的數量。

類別	基準測試	Phi-3-Mini-128K-Ins	Gemma-7B	Mistral-7B	Mixtral-8x7B	Llama-3-8B-Ins	GPT3.5-Turbo-1106
流行聚合基準測試	AGI評估 5-shot	39.5	42.1	35.1	45.2	42	48.4
	MMLU 5-shot	69.7	63.6	61.7	70.5	66.5	71.4
	BigBench Hard 3-shot	72.1	59.6	57.3	69.7	51.5	68.3
語言理解	ANLI 7-shot	52.3	48.7	47.1	55.2	57.3	58.1
	HellaSwag 5-shot	70.5	49.8	58.5	70.4	71.1	78.8
推理	ARC挑戰 10-shot	85.5	78.3	78.6	87.3	82.8	87.4
	BoolQ 0-shot	77.1	66	72.2	76.6	80.9	79.1
	MedQA 2-shot	56.4	49.6	50	62.2	60.5	63.4
	OpenBookQA 10-shot	78.8	78.6	79.8	85.8	82.6	86
	PIQA 5-shot	80.1	78.1	77.7	86	75.7	86.6
	GPQA 0-shot	29.7	2.9	15	6.9	32.4	29.9
	Social IQA 5-shot	74.7	65.5	74.6	75.9	73.9	68.3
	TruthfulQA (MC2) 10-shot	64.8	52.1	53	60.1	63.2	67.7
	WinoGrande 5-shot	71.0	55.6	54.2	62	65	68.8
事實知識	TriviaQA 5-shot	57.8	72.3	75.2	82.2	67.7	85.8
數學	GSM8K CoTT 8-shot	85.3	59.8	46.4	64.7	77.4	78.1
代碼生成	HumanEval 0-shot	60.4	34.1	28.0	37.8	60.4	62.2
	MBPP 3-shot	70.0	51.5	50.8	60.2	67.7	77.8
平均		66.4	56.0	56.4	64.4	65.5	70.3

長期上下文：Phi-3 Mini-128K-Instruct支持128K上下文長度，因此該模型能夠處理包括長文檔/會議摘要、長文檔問答在內的多個長期上下文任務。

基準測試	Phi-3 Mini-128K-Instruct	Mistral-7B	Mixtral 8x7B	LLaMA-3-8B-Instruct
GovReport	25.3	4.9	20.3	10.3
QMSum	21.9	15.5	20.6	2.9
Qasper	41.6	23.5	26.6	8.1
SQuALITY	24.1	14.7	16.2	25
SummScreenFD	16.8	9.3	11.3	5.1
平均	25.9	13.6	19.0	10.3

我們在以下表格中更詳細地查看了100個公開基準數據集的不同類別：

類別	Phi-3-Mini-128K-Instruct	Gemma-7B	Mistral-7B	Mixtral 8x7B	Llama-3-8B-Instruct	GPT-3.5-Turbo
流行聚合基準測試	60.6	59.4	56.5	66.2	59.9	67.0
推理	69.4	60.3	62.8	68.1	69.6	71.7
語言理解	57.5	57.6	52.5	66.1	63.2	67.7
代碼生成	61.0	45.6	42.9	52.7	56.4	70.4
數學	51.6	35.8	25.4	40.3	41.1	52.8
事實知識	35.8	46.7	49.8	58.6	43.1	63.4
多語言	56.4	66.5	57.4	66.7	66.6	71.0
魯棒性	61.1	38.4	40.6	51.0	64.5	69.3