Phi-3 Mini 128K Instruct开源模型 - 轻量级，超强推理支持128K上下文

首页

Phi 3 Mini 128k Instruct

由 microsoft 开发

Phi-3 Mini 128K Instruct是一个38亿参数的轻量级开源模型，专注于推理能力，支持128K上下文长度。

大型语言模型

Transformers

支持多种语言开源协议:MIT #轻量级推理 #长上下文处理 #代码生成

下载量 399.68k

发布时间 : 4/22/2024

模型简介

该模型是Phi-3系列的一部分，经过训练优化指令遵循和安全性，在常识、语言理解、数学、编码和逻辑推理方面表现优异。

模型特点

长上下文支持

支持128K token的上下文长度，适合处理长文档和复杂对话场景。

轻量高效

仅38亿参数，在资源受限环境中仍能提供高性能推理。

多领域能力

在代码、数学和逻辑推理等需要强推理能力的领域表现突出。

安全优化

经过专门训练以遵循安全措施和负责任AI准则。

模型能力

文本生成

代码理解与生成

数学问题求解

逻辑推理

长文档处理

多轮对话

使用案例

开发辅助

代码生成与解释

帮助开发者生成代码片段或解释复杂代码逻辑

提升开发效率，降低学习曲线

教育

数学问题解答

解答各类数学问题并提供解题步骤

辅助数学学习和教学

商业分析

长文档摘要

处理和分析长篇幅商业文档

快速提取关键信息，提高决策效率

🚀 Phi-4

Phi-4提供了多种版本，包括多模态指令版本和ONNX版本，以及迷你指令版本和对应的ONNX版本：

[多模态指令 | ONNX]
[迷你指令 | ONNX]

Phi-4是一款功能强大的模型，适用于多种自然语言处理任务，能为用户提供高效、准确的文本生成服务。

🚀 快速开始

Phi-3-Mini-128K-Instruct已集成到transformers的开发版本（4.41.3）中。在通过pip发布官方版本之前，你可以采取以下操作：

加载模型时，确保在from_pretrained()函数中传入trust_remote_code=True参数。
将本地的transformers更新到开发版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。此命令是从源代码克隆并安装的替代方法。

可以使用pip list | grep transformers来验证当前transformers的版本。

示例代码

以下代码展示了如何在GPU上快速运行该模型：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

注意：如果你想使用闪存注意力机制，可以在调用AutoModelForCausalLM.from_pretrained()时传入attn_implementation="flash_attention_2"。

✨ 主要特性

模型概述

Phi-3-Mini-128K-Instruct是一个具有38亿参数的轻量级、最先进的开放模型，使用Phi-3数据集进行训练。该数据集包含合成数据和经过筛选的公开网站数据，强调高质量和富含推理的特性。

该模型属于Phi-3系列的Mini版本，有两种变体：4K和128K，分别表示其支持的上下文长度（以标记为单位）。

初始训练后，模型经过了包括监督微调（SFT）和直接偏好优化（DPO）的后训练过程，以增强其遵循指令和遵守安全措施的能力。在针对常识、语言理解、数学、编码、长期上下文和逻辑推理等基准测试中，Phi-3 Mini-128K-Instruct在参数少于130亿的模型中表现出了强大的性能。

资源与技术文档

适用场景

主要用例：该模型适用于英语的商业和研究用途，可用于以下需求的应用程序：
- 内存/计算受限的环境
- 低延迟场景
- 强推理能力（特别是代码、数学和逻辑方面）
用例考虑：我们的模型并非专门为所有下游用途设计或评估。开发者在选择用例时应考虑语言模型的常见局限性，并在特定下游用例中使用之前评估和缓解准确性、安全性和公平性问题，特别是在高风险场景中。开发者应了解并遵守与其用例相关的适用法律或法规（包括隐私、贸易合规法等）。

版本更新说明

本次更新是基于宝贵的客户反馈对原始指令微调的Phi-3-mini版本进行的改进。模型使用了额外的后训练数据，在长期上下文理解、指令遵循和结构化输出方面取得了显著提升。同时，我们还提高了多轮对话质量，明确支持<|system|>标签，并显著增强了推理能力。

我们相信大多数用例将从本次发布中受益，但建议用户在其特定的AI应用中进行测试。我们感谢社区对Phi-3模型系列的热情采用，并继续欢迎来自社区的所有反馈。

以下表格展示了新版本在公开和内部基准数据集上的指令遵循、结构化输出、推理和长期上下文理解方面的改进：

基准测试	原始版本	2024年6月更新版本
指令超难测试	5.7	5.9
指令难测试	5.0	5.2
JSON结构化输出	1.9	60.1
XML结构化输出	47.8	52.9
GPQA	25.9	29.7
MMLU	68.1	69.7
平均	25.7	37.3

RULER：基于检索的长期上下文理解基准测试

模型	4K	8K	16K	32K	64K	128K	平均
原始版本	86.7	78.1	75.6	70.3	58.9	43.3	68.8
2024年6月更新版本	92.4	91.1	90.8	87.9	79.8	65.6	84.6

RepoQA：长期上下文代码理解基准测试

模型	Python	C++	Rust	Java	TypeScript	平均
原始版本	27	29	40	33	33	32.4
2024年6月更新版本	85	63	72	93	72	77

注意：如果用户想查看之前的版本，可以使用git提交ID bb5bf1e4001277a606e11debca0ef80323e5f824。对于模型转换（如GGUF和其他格式），我们邀请社区尝试各种方法并分享宝贵的反馈。

📦 安装指南

Phi-3-Mini-128K-Instruct已集成到transformers的开发版本（4.41.3）中。在通过pip发布官方版本之前，你可以采取以下操作：

加载模型时，确保在from_pretrained()函数中传入trust_remote_code=True参数。
将本地的transformers更新到开发版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。此命令是从源代码克隆并安装的替代方法。

可以使用pip list | grep transformers来验证当前transformers的版本。

所需包示例

flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0
transformers==4.41.2

💻 使用示例

基础用法

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

高级用法

如果你想使用闪存注意力机制，可以在调用AutoModelForCausalLM.from_pretrained()时传入attn_implementation="flash_attention_2"：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3-mini-128k-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,
    attn_implementation="flash_attention_2" 
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

📚 详细文档

分词器

Phi-3 Mini-128K-Instruct支持最多32064个标记的词汇表。分词器文件已经提供了可用于下游微调的占位符标记，但也可以扩展到模型的词汇表大小。

聊天格式

由于训练数据的性质，Phi-3 Mini-128K-Instruct模型最适合使用以下聊天格式的提示：你可以使用通用模板将提示作为问题提供：

<|system|>
You are a helpful assistant.<|end|>
<|user|>
Question?<|end|>
<|assistant|>

例如：

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

模型将在<|assistant|>之后生成文本。在少样本提示的情况下，提示可以格式化为以下形式：

<|system|>
You are a helpful travel assistant.<|end|>
<|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."<|end|>
<|user|>
What is so great about #1?<|end|>
<|assistant|>

🔧 技术细节

模型

架构：Phi-3 Mini-128K-Instruct有38亿参数，是一个密集的仅解码器Transformer模型。该模型通过监督微调（SFT）和直接偏好优化（DPO）进行微调，以确保与人类偏好和安全准则保持一致。
输入：文本。最适合使用聊天格式的提示。
上下文长度：128K标记
GPU：512个H100-80G
训练时间：10天
训练数据：4.9T标记
输出：对输入的生成文本响应
日期：我们的模型在2024年5月至6月期间进行训练
状态：这是一个基于截止日期为2023年10月的离线数据集训练的静态模型。随着我们对模型的改进，未来可能会发布微调模型的新版本。
发布日期：2024年6月

数据集

我们的训练数据包括来自多种来源的总共4.9万亿个标记，是以下数据的组合：

经过严格质量筛选的公开可用文档、选定的高质量教育数据和代码。
为教授数学、编码、常识推理、世界常识（科学、日常活动、心智理论等）而新创建的合成“教科书式”数据。
涵盖各种主题的高质量聊天格式监督数据，以反映人类在指令遵循、真实性、诚实性和帮助性等不同方面的偏好。

我们专注于可能提高模型推理能力的数据质量，并筛选公开可用文档以包含适当水平的知识。例如，某一天英超联赛的比赛结果可能是前沿模型的良好训练数据，但对于小尺寸模型，我们需要去除此类信息，以便为推理留出更多模型容量。有关数据的更多详细信息，请参阅Phi-3技术报告。

微调

此处提供了一个使用TRL和Accelerate模块进行多GPU监督微调（SFT）的基本示例。

基准测试

我们报告了Phi-3-Mini-128K-Instruct在标准开源基准测试中的完成格式结果，这些基准测试用于衡量模型的推理能力（包括常识推理和逻辑推理）。我们将其与Mistral-7b-v0.1、Mixtral-8x7b、Gemma 7B、Llama-3-8B-Instruct和GPT-3.5进行了比较。

所有报告的数字都是使用完全相同的管道生成的，以确保数字具有可比性。由于评估中的细微差异，这些数字可能与其他公布的数字有所不同。

按照现在的标准，我们使用少样本提示在温度为0的情况下评估模型。提示和少样本数量是微软内部评估语言模型工具的一部分，特别是我们没有对Phi-3的管道进行优化。具体来说，我们没有更改提示、选择不同的少样本示例、更改提示格式或对模型进行任何其他形式的优化。

每个基准测试列出了少样本示例的数量。

类别	基准测试	Phi-3-Mini-128K-Ins	Gemma-7B	Mistral-7B	Mixtral-8x7B	Llama-3-8B-Ins	GPT3.5-Turbo-1106
流行聚合基准测试	AGI评估 5-shot	39.5	42.1	35.1	45.2	42	48.4
	MMLU 5-shot	69.7	63.6	61.7	70.5	66.5	71.4
	BigBench Hard 3-shot	72.1	59.6	57.3	69.7	51.5	68.3
语言理解	ANLI 7-shot	52.3	48.7	47.1	55.2	57.3	58.1
	HellaSwag 5-shot	70.5	49.8	58.5	70.4	71.1	78.8
推理	ARC挑战 10-shot	85.5	78.3	78.6	87.3	82.8	87.4
	BoolQ 0-shot	77.1	66	72.2	76.6	80.9	79.1
	MedQA 2-shot	56.4	49.6	50	62.2	60.5	63.4
	OpenBookQA 10-shot	78.8	78.6	79.8	85.8	82.6	86
	PIQA 5-shot	80.1	78.1	77.7	86	75.7	86.6
	GPQA 0-shot	29.7	2.9	15	6.9	32.4	29.9
	Social IQA 5-shot	74.7	65.5	74.6	75.9	73.9	68.3
	TruthfulQA (MC2) 10-shot	64.8	52.1	53	60.1	63.2	67.7
	WinoGrande 5-shot	71.0	55.6	54.2	62	65	68.8
事实知识	TriviaQA 5-shot	57.8	72.3	75.2	82.2	67.7	85.8
数学	GSM8K CoTT 8-shot	85.3	59.8	46.4	64.7	77.4	78.1
代码生成	HumanEval 0-shot	60.4	34.1	28.0	37.8	60.4	62.2
	MBPP 3-shot	70.0	51.5	50.8	60.2	67.7	77.8
平均		66.4	56.0	56.4	64.4	65.5	70.3

长期上下文：Phi-3 Mini-128K-Instruct支持128K上下文长度，因此该模型能够处理包括长文档/会议摘要、长文档问答在内的多个长期上下文任务。

基准测试	Phi-3 Mini-128K-Instruct	Mistral-7B	Mixtral 8x7B	LLaMA-3-8B-Instruct
GovReport	25.3	4.9	20.3	10.3
QMSum	21.9	15.5	20.6	2.9
Qasper	41.6	23.5	26.6	8.1
SQuALITY	24.1	14.7	16.2	25
SummScreenFD	16.8	9.3	11.3	5.1
平均	25.9	13.6	19.0	10.3

我们在以下表格中更详细地查看了100个公开基准数据集的不同类别：

类别	Phi-3-Mini-128K-Instruct	Gemma-7B	Mistral-7B	Mixtral 8x7B	Llama-3-8B-Instruct	GPT-3.5-Turbo
流行聚合基准测试	60.6	59.4	56.5	66.2	59.9	67.0
推理	69.4	60.3	62.8	68.1	69.6	71.7
语言理解	57.5	57.6	52.5	66.1	63.2	67.7
代码生成	61.0	45.6	42.9	52.7	56.4	70.4
数学	51.6	35.8	25.4	40.3	41.1	52.8
事实知识	35.8	46.7	49.8	58.6	43.1	63.4
多语言	56.4	66.5	57.4	66.7	66.6	71.0
鲁棒性	61.1	38.4	40.6	51.0	64.5	69.3