Phi-3-small-8k-instruct开源模型 - 轻量级高效推理，支持英语商研应用

首页

Phi 3 Small 8k Instruct

由 microsoft 开发

Phi-3-Small-8K-Instruct是一个70亿参数的轻量级开源模型，专注于高质量推理能力，支持8K上下文长度，适用于英语环境下的商业和研究用途。

大型语言模型

Transformers

其他开源协议:MIT #轻量级推理 #多语言代码生成 #8K长文本处理

下载量 22.92k

发布时间 : 5/7/2024

模型简介

基于Phi-3数据集训练的轻量级尖端模型，优化了推理能力，特别适合资源受限环境和对延迟敏感的场景。

模型特点

轻量高效

70亿参数设计，适合资源受限环境和延迟敏感场景

强大推理能力

在常识、语言理解、数学、代码和逻辑推理方面表现优异

安全对齐

经过监督微调和直接偏好优化(DPO)训练，确保指令遵循和安全措施

长上下文支持

提供8K和128K两种上下文长度变体

模型能力

文本生成

代码生成

数学推理

逻辑推理

常识问答

语言理解

使用案例

商业应用

客户服务助手

用于生成快速准确的客户服务响应

提高响应速度和服务质量

内容生成

自动生成营销文案、产品描述等内容

提升内容创作效率

研究开发

AI研究

作为语言模型研究的构建模块

加速AI技术发展

教育工具

辅助编程和数学学习

提供个性化学习体验

🚀 Phi-3-Small-8K-Instruct

Phi-3-Small-8K-Instruct 是一款轻量级的先进开源模型，具备 70 亿参数。它基于 Phi-3 数据集训练，涵盖合成数据与高质量公开网页数据，在常识、语言理解、数学、代码、长上下文和逻辑推理等基准测试中表现出色。

🚀 快速开始

安装依赖

Phi-3-Small-8K-Instruct 已集成在 transformers 的开发版本（4.40.2）中。在通过 pip 发布官方版本之前，请确保执行以下操作之一：

安装 tiktoken (0.6.0) 和 triton (2.3.0)。
加载模型时，确保在 from_pretrained() 函数中传入 trust_remote_code=True 参数。
将本地的 transformers 更新到开发版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。此命令是克隆并从源代码安装的替代方法。

可以使用 pip list | grep transformers 验证当前的 transformers 版本。

运行模型

Phi-3-Small-8K-Instruct 也可在 Azure AI 中使用。以下是在 GPU 上运行模型的示例代码：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

⚠️ 重要提示

一些应用程序或框架可能不会在对话开始时包含 BOS 标记 (<|endoftext|>)。请确保包含该标记，因为这样可以获得更可靠的结果。

✨ 主要特性

轻量级设计：具备 70 亿参数，适用于内存/计算受限的环境和低延迟场景。
强大推理能力：在代码、数学和逻辑推理方面表现出色，可用于通用人工智能系统和应用。
多语言支持：支持最多 100352 个标记的词汇量，训练数据包含 10% 的多语言数据。
上下文长度灵活：有 8K 和 128K 两种上下文长度变体可供选择。

📦 安装指南

在官方版本通过 pip 发布之前，需要进行以下操作：

安装 tiktoken (0.6.0) 和 triton (2.3.0)。
加载模型时，确保在 from_pretrained() 函数中传入 trust_remote_code=True 参数。
更新本地的 transformers 到开发版本：pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

高级用法

在不同的应用场景中，可以调整 generation_args 中的参数，如 max_new_tokens、temperature 和 do_sample 等，以获得不同的生成效果。

📚 详细文档

模型概述

Phi-3-Small-8K-Instruct 属于 Phi-3 系列的小型版本，有 8K 和 128K 两种上下文长度变体。该模型经过监督微调（SFT）和直接偏好优化（DPO）的后训练过程，以确保遵循指令和安全措施。

预期用途

主要用例

适用于英语的广泛商业和研究用途。
可用于内存/计算受限的环境、低延迟场景以及需要强大推理能力的通用人工智能系统和应用。

使用案例考虑

模型并非专门为所有下游用途设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用之前评估和缓解准确性、安全性和公平性问题，特别是在高风险场景中。
开发者应了解并遵守与其用例相关的适用法律或法规（包括隐私、贸易合规法律等）。

分词器

Phi-3-Small-8K-Instruct 支持最多 100352 个标记的词汇量。

聊天格式

由于训练数据的性质，Phi-3-Small-8K-Instruct 模型最适合使用以下聊天格式的提示：

<|endoftext|><|user|>\nQuestion <|end|>\n<|assistant|>

例如：

<|endoftext|><|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

在少样本提示的情况下，提示可以格式化为以下形式：

<|endoftext|><|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."<|end|>
<|user|>
What is so great about #1?<|end|>
<|assistant|>

🔧 技术细节

模型

架构：Phi-3 Small-8K-Instruct 具有 70 亿参数，是一个密集的仅解码器 Transformer 模型，采用交替的密集和块稀疏注意力机制。
输入：文本，最适合使用聊天格式的提示。
上下文长度：8K 标记。
GPU：1024 个 NVIDIA H100-80G。
训练时间：18 天。
训练数据：4.8 万亿标记。
输出：对输入的生成文本。
日期：模型于 2024 年 2 月至 4 月进行训练。
状态：这是一个基于 2023 年 10 月截止日期的离线数据集训练的静态模型。随着模型的改进，未来可能会发布调优模型的新版本。
发布日期：模型权重于 2024 年 5 月 21 日发布。

数据集

训练数据包括多种来源，总计 4.8 万亿标记（包括 10% 的多语言数据），是以下数据的组合：

经过严格质量过滤的公开可用文档、精选的高质量教育数据和代码。
为教授数学、编码、常识推理、世界常识（科学、日常活动、心智理论等）而新创建的合成“教科书式”数据。
涵盖各种主题的高质量聊天格式监督数据，以反映人类在遵循指令、真实性、诚实性和有用性等不同方面的偏好。

基准测试

在标准开源基准测试中，对 Phi-3-Small-8K-Instruct 的推理能力（包括常识推理和逻辑推理）进行了评估，并与 Mixtral-8x7b、Gemini-Pro、Gemma 7B、Llama-3-8B-Instruct、GPT-3.5-Turbo-1106 和 GPT-4-Turbo-1106 进行了比较。具体结果如下：

基准测试	Phi-3-Small-8K-Instruct 7b	Gemma 7B	Mixtral 8x7B	Llama-3-Instruct 8b	GPT-3.5-Turbo version 1106	Gemini Pro	GPT-4-Turbo version 1106 (Chat)
AGI Eval 5-shot	45.1	42.1	45.2	42.0	48.4	49.0	59.6
MMLU 5-shot	75.7	63.6	70.5	66.5	71.4	66.7	84.0
BigBench Hard 3-shot	79.1	59.6	69.7	51.5	68.3	75.6	87.7
ANLI 7-shot	58.1	48.7	55.2	57.3	58.1	64.2	71.7
HellaSwag 5-shot	77.0	49.8	70.4	71.1	78.8	76.2	88.3
ARC Challenge 10-shot	90.7	78.3	87.3	82.8	87.4	88.3	95.6
ARC Easy 10-shot	97.0	91.4	95.6	93.4	96.3	96.1	98.8
BoolQ 2-shot	84.8	66.0	76.6	80.9	79.1	86.4	91.3
CommonsenseQA 10-shot	80.0	76.2	78.1	79.0	79.6	81.8	86.7
MedQA 2-shot	65.4	49.6	62.2	60.5	63.4	58.2	83.7
OpenBookQA 10-shot	88.0	78.6	85.8	82.6	86.0	86.4	93.4
PIQA 5-shot	86.9	78.1	86.0	75.7	86.6	86.2	90.1
Social IQA 5-shot	79.2	65.5	75.9	73.9	68.3	75.4	81.7
TruthfulQA (MC2) 10-shot	70.2	52.1	60.1	63.2	67.7	72.6	85.2
WinoGrande 5-shot	81.5	55.6	62.0	65.0	68.8	72.2	86.7
TriviaQA 5-shot	58.1	72.3	82.2	67.7	85.8	80.2	73.3
GSM8K Chain of Thought 8-shot	89.6	59.8	64.7	77.4	78.1	80.4	94.2
HumanEval 0-shot	61.0	34.1	37.8	60.4	62.2	64.4	79.9
MBPP 3-shot	71.7	51.5	60.2	67.7	77.8	73.2	86.7
平均	75.7	61.8	69.8	69.4	74.3	75.4	85.2

不同类别表现

基准测试	Phi-3-Small-8K-Instruct 7b	Gemma 7B	Mixtral 8x7B	Llama-3-Instruct 8b	GPT-3.5-Turbo version 1106	Gemini Pro	GPT-4-Turbo version 1106 (Chat)
流行聚合基准测试	71.1	59.4	66.2	59.9	67.0	67.5	80.5
推理	82.4	69.1	77.0	75.7	78.3	80.4	89.3
语言理解	70.6	58.4	64.9	65.4	70.4	75.3	81.6
代码生成	60.7	45.6	52.7	56.4	70.4	66.7	76.1
数学	51.6	35.8	40.3	41.1	52.8	50.9	67.1
事实知识	38.6	46.7	58.6	43.1	63.4	54.6	45.9
多语言	62.5	63.2	63.4	65.0	69.1	76.5	82.0
鲁棒性	72.9	38.4	51.0	64.5	69.3	69.7	84.6

软件

硬件

默认情况下，Phi-3-Small 模型使用 Flash Attention 2 和 Triton 块稀疏注意力，需要特定类型的 GPU 硬件才能运行。已在以下 GPU 类型上进行测试：

NVIDIA A100
NVIDIA A6000
NVIDIA H100

如果要在 GPU、CPU 和移动设备上进行优化推理，可以使用 ONNX 模型 8K。

跨平台支持

ONNX 运行时生态系统现在支持 Phi3 小型模型跨平台和硬件运行。优化后的 phi-3 模型也以 ONNX 格式发布，可在 CPU 和 GPU 上跨设备运行，包括服务器平台、Windows、Linux 和 Mac 桌面以及移动 CPU，并针对每个目标采用最合适的精度。Windows 桌面 GPU（AMD、Intel 和 NVIDIA）支持 DirectML GPU 加速。ONNX 运行时除了 DML 之外，还为 Phi3 Small 提供了跨 CPU、GPU 和移动设备的跨平台支持。以下是添加的一些优化配置：

用于 int4 DML 的 ONNX 模型：通过 AWQ 量化为 int4。
用于 fp16 CUDA 的 ONNX 模型。
用于 int4 CUDA 的 ONNX 模型：通过 RTN 量化为 int4。
用于 int4 CPU 和移动设备的 ONNX 模型：通过 RTN 量化为 int4。

📄 许可证

该模型根据 MIT 许可证发布。

商标

本项目可能包含项目、产品或服务的商标或徽标。对 Microsoft 商标或徽标的授权使用需遵循并必须遵守 Microsoft 的商标和品牌指南。在本项目的修改版本中使用 Microsoft 商标或徽标不得造成混淆或暗示 Microsoft 的赞助。任何第三方商标或徽标的使用均需遵循这些第三方的政策。