Phi-4-mini-instruct-float8dq开源模型 - 显存降低速度提升，精度影响小

首页

Phi 4 Mini Instruct Float8dq

由 pytorch 开发

Phi-4-mini-instruct模型经torchao进行float8动态激活和权重量化，在H100上实现36%显存降低和15-20%速度提升，几乎不影响精度。

大型语言模型

Transformers

其他开源协议:MIT #float8量化 #高效推理 #多语言对话

下载量 1,006

发布时间 : 4/8/2025

模型简介

基于Microsoft Phi-4-mini-instruct的量化版本，适用于文本生成任务，支持多语言交互和数学推理。

模型特点

高效量化

采用float8动态激活和权重量化技术，显著降低显存占用

性能优化

在H100上实现15-20%推理速度提升

多任务支持

支持代码生成、数学推理和对话任务

精度保留

量化后模型精度损失极小（基准测试显示总体表现仅下降0.24%）

模型能力

文本生成

数学问题求解

代码生成

多语言对话

逻辑推理

使用案例

教育辅助

数学解题

帮助学生理解代数方程解法

可正确解答2x+3=7类方程

创意生成

食谱建议

生成水果搭配创意食谱

提供香蕉火龙果奶昔等具体方案

技术问答

编程帮助

解释代码逻辑或生成代码片段

🚀 [Phi4-mini-instruct-float8dq模型]

[Phi4-mini-instruct-float8dq模型由PyTorch团队使用torchao进行float8动态激活和float8权重量化（每行粒度）。可直接使用该模型，或借助vLLM进行服务部署，在H100上可减少36%的显存使用，提速15 - 20%，且几乎不影响准确性。]

🚀 快速开始

本项目提供了使用vLLM和Transformers进行推理、模型量化、模型评估等功能。下面将为你详细介绍各部分的使用方法。

✨ 主要特性

使用torchao对Phi4-mini模型进行float8动态激活和float8权重量化（每行粒度）。
可直接使用量化后的模型，也可借助vLLM进行服务部署，在H100上可减少36%的显存使用，提速15 - 20%，且几乎不影响准确性。

📦 安装指南

安装vLLM nightly版本

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install torchao

推理和量化所需的其他依赖

pip install git+https://github.com/huggingface/transformers@main
pip install torch
pip install accelerate

量化安装

pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

模型评估所需依赖

需从源码安装lm-eval：

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

推送到Hugging Face Hub所需依赖

pip install -U "huggingface_hub[cli]"

vLLM基准测试所需依赖

git clone git@github.com:vllm-project/vllm.git
VLLM_USE_PRECOMPILED=1 pip install --editable .

下载ShareGPT数据集

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

💻 使用示例

基础用法

使用vLLM进行推理

# 安装vllm nightly版本以获取最新更改
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install torchao

from vllm import LLM, SamplingParams

# 示例提示
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# 创建采样参数对象
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


if __name__ == '__main__':
    # 创建LLM实例
    llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
    # 从提示生成文本
    # 输出是一个RequestOutput对象列表，包含提示、生成的文本和其他信息
    outputs = llm.generate(prompts, sampling_params)
    # 打印输出
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)

使用Transformers进行推理

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model_path = "pytorch/Phi-4-mini-instruct-float8dq"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

模型量化

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 推送到Hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# 手动测试
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

高级用法

使用vLLM进行服务部署

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

模型评估

# 基准模型评估
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

# 量化模型评估
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8

峰值内存使用测试

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

# 使用 "microsoft/Phi-4-mini-instruct" 或 "pytorch/Phi-4-mini-instruct-float8dq"
model_id = "pytorch/Phi-4-mini-instruct-float8dq"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

torch.cuda.reset_peak_memory_stats()

prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

mem = torch.cuda.max_memory_reserved() / 1e9
print(f"Peak Memory Usage: {mem:.02f} GB")

模型性能基准测试

# 获取vllm源码
git clone git@github.com:vllm-project/vllm.git
VLLM_USE_PRECOMPILED=1 pip install --editable .

# 延迟基准测试 - 基准模型
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

# 延迟基准测试 - 量化模型
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1

# 服务基准测试 - 基准模型
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

# 服务基准测试 - 量化模型
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1

📚 详细文档

模型质量评估

我们使用lm-evaluation-harness对量化模型的质量进行评估。以下是基准模型和量化模型在不同基准测试中的表现：

基准测试	Phi-4-mini-ins	Phi-4-mini-instruct-float8dq
流行聚合基准测试
mmlu (0-shot)	66.73	66.61
mmlu_pro (5-shot)	46.43	44.58
推理能力
arc_challenge (0-shot)	56.91	56.66
gpqa_main_zeroshot	30.13	29.46
HellaSwag	54.57	54.55
openbookqa	33.00	33.60
piqa (0-shot)	77.64	77.48
social_iqa	49.59	49.28
truthfulqa_mc2 (0-shot)	48.39	48.09
winogrande (0-shot)	71.11	72.77
多语言能力
mgsm_en_cot_en	60.8	60.0
数学能力
gsm8k (5-shot)	81.88	80.89
mathqa (0-shot)	42.31	42.51
总体表现	55.35	55.11

峰值内存使用

以下是基准模型和量化模型在推理过程中的峰值内存使用情况：

基准测试	Phi-4 mini-Ins	Phi-4-mini-instruct-float8dq
峰值内存 (GB)	8.91	5.70 (36% 减少)

模型性能

以下是基准模型和量化模型在H100机器上的性能表现：

基准测试	Phi-4 mini-Ins	Phi-4-mini-instruct-float8dq
延迟 (batch_size=1)	1.64s	1.41s (16% 提速)
延迟 (batch_size=128)	3.1s	2.72s (14% 提速)
服务 (num_prompts=1)	1.35 req/s	1.57 req/s (16% 提速)
服务 (num_prompts=1000)	66.68 req/s	80.53 req/s (21% 提速)