DeepSeek-R1-Distill-Llama-70B-FP8-dynamic开源模型 - 优化推理性能让处理更高效

首页

Deepseek R1 Distill Llama 70B FP8 Dynamic

由 RedHatAI 开发

DeepSeek-R1-Distill-Llama-70B的FP8量化版本，通过减少权重和激活的位数来优化推理性能

大型语言模型

Transformers

开源协议:MIT #FP8量化 #多GPU推理 #高效部署

下载量 45.77k

发布时间 : 2/1/2025

模型简介

这是DeepSeek-R1-Distill-Llama-70B的量化版本，通过将权重和激活量化为FP8数据类型，减少了磁盘大小和GPU内存需求，同时在推理性能上有显著提升。

模型特点

FP8量化

权重和激活均使用FP8数据类型进行量化，减少50%的磁盘大小和GPU内存需求

高效推理

在单流部署中最高可实现1.4倍加速，在多流异步部署中最高可实现3.0倍加速

vLLM兼容

支持使用vLLM后端进行高效部署，提供OpenAI兼容的服务接口

模型能力

文本生成

指令跟随

多轮对话

代码补全

文档生成

RAG应用

使用案例

对话系统

多轮对话

支持复杂的多轮对话场景

在512/256令牌配置下，A100x4硬件上达到8.90 QPS

代码生成

代码补全

支持编程语言的代码补全功能

HumanEval测试中pass@1达到81.00%

信息检索

RAG应用

支持基于检索增强生成的问答系统

在1024/128令牌配置下，A100x4硬件上达到7.42 QPS

🚀 DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

这是 DeepSeek-R1-Distill-Llama-70B 的量化版本，通过将权重和激活量化为 FP8 数据类型，减少了磁盘大小和 GPU 内存需求，同时在推理性能上有显著提升。

🚀 快速开始

使用 vLLM 部署模型

此模型可以使用 vLLM 后端进行高效部署，示例代码如下：

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

vLLM 还支持与 OpenAI 兼容的服务，更多详细信息请参阅文档。

✨ 主要特性

模型架构：LlamaForCausalLM，输入和输出均为文本。
模型优化：
- 权重量化：FP8
- 激活量化：FP8
发布日期：2025 年 2 月 1 日
版本：1.0
模型开发者：Neural Magic

通过将 DeepSeek-R1-Distill-Llama-70B 的权重和激活量化为 FP8 数据类型，该优化将每个参数的位数从 16 位减少到 8 位，大约减少了 50% 的磁盘大小和 GPU 内存需求。仅对 Transformer 块内线性算子的权重和激活进行量化，权重使用对称的每通道方案进行量化，激活使用对称的每令牌方案进行量化，使用 LLM Compressor 进行量化。

📦 安装指南

文档未提及具体安装步骤，可参考 vLLM 和相关依赖的官方文档进行安装。

💻 使用示例

基础用法

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

模型创建示例

此模型使用 llm-compressor 创建，运行以下代码片段：

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import os

# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = model_stub.split("/")[-1]

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=2,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 详细文档

评估

该模型在 OpenLLM 排行榜 V1 和 V2 上进行了评估，使用以下命令：

OpenLLM 排行榜 V1：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM 排行榜 V2：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

准确率

类别	指标	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	恢复率
推理	AIME 2024 (pass@1)	67.83	69.17	101.98%
	MATH-500 (pass@1)	95.29	95.14	99.84%
	GPQA Diamond (pass@1)	65.57	65.15	99.36%
	平均得分	76.23	76.49	100.34%
OpenLLM V1	ARC-Challenge (Acc-Norm, 25-shot)	63.65	63.05	99.1%
	GSM8K (Strict-Match, 5-shot)	93.03	93.03	100.0%
	HellaSwag (Acc-Norm, 10-shot)	84.85	84.71	99.8%
	MMLU (Acc, 5-shot)	78.04	77.45	99.3%
	TruthfulQA (MC2, 0-shot)	56.67	56.62	99.9%
	Winogrande (Acc, 5-shot)	78.22	78.45	100.3%
	平均得分	75.74	75.55	99.8%
OpenLLM V2	IFEval (Inst Level Strict Acc, 0-shot)	42.45	42.11	99.2%
	BBH (Acc-Norm, 3-shot)	21.26	19.77	93.0%
	Math-Hard (Exact-Match, 4-shot)	0.00	0.00	---
	GPQA (Acc-Norm, 0-shot)	9.51	6.97	---
	MUSR (Acc-Norm, 0-shot)	14.87	14.60	---
	MMLU-Pro (Acc, 5-shot)	4.27	5.76	---
	平均得分	15.39	14.87	96.6%
编码	HumanEval (pass@1)	81.10	81.00	99.9%
	HumanEval (pass@10)	87.60	88.60	101.1%
	HumanEval+ (pass@10)	75.20	75.50	100.4%
	HumanEval+ (pass@10)	83.10	84.30	101.4%

推理性能

此模型在单流部署中最高可实现 1.4 倍加速，在多流异步部署中最高可实现 3.0 倍加速，具体取决于硬件和用例场景。以下性能基准测试使用 vLLM 版本 0.7.2 和 GuideLLM 进行。

基准测试命令

guidellm --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server

单流性能（使用 vLLM 版本 0.7.2 测量）

GPU 类别	GPU 数量	模型	平均成本降低	指令跟随 256 / 128 延迟 (s)	指令跟随 256 / 128 QPD	多轮对话 512 / 256 延迟 (s)	多轮对话 512 / 256 QPD	文档字符串生成 768 / 128 延迟 (s)	文档字符串生成 768 / 128 QPD	RAG 1024 / 128 延迟 (s)	RAG 1024 / 128 QPD	代码补全 256 / 1024 延迟 (s)	代码补全 256 / 1024 QPD	代码修复 1024 / 1024 延迟 (s)	代码修复 1024 / 1024 QPD	大摘要 4096 / 512 延迟 (s)	大摘要 4096 / 512 QPD	大 RAG 10240 / 1536 延迟 (s)	大 RAG 10240 / 1536 QPD
A6000	4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	7.4	152	14.9	76	7.5	149	7.7	146	57.2	20	58.9	19	31.9	35	98.4	11
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.93	7.7	292	15.2	148	7.8	287	8.0	282	60.7	37	60.2	37	32.3	70	104.0	22
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.83	4.9	457	10.0	225	5.5	411	5.8	389	38.9	58	39.2	57	23.7	95	76.6	29
A100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	6.4	157	12.8	79	6.6	153	6.7	151	50.4	20	50.8	20	27.0	37	85.4	12
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.48	4.1	245	8.2	123	4.2	238	4.3	235	32.4	31	32.8	31	17.6	57	90.8	11
	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.69	4.6	440	9.2	220	4.9	407	5.2	389	35.3	57	36.3	55	21.2	95	68.1	30
H100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.8	149	7.6	74	3.9	146	3.9	144	30.0	19	30.4	19	16.1	35	56.5	10
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.39	2.7	210	5.3	106	2.7	207	2.8	203	21.1	27	21.4	26	11.5	49	47.2	12
	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.83	4.0	277	7.9	138	4.1	266	4.2	262	31.2	35	31.8	34	17.8	61	61.4	18

用例配置文件：提示令牌 / 生成令牌

**QPD：每美元查询次数，基于 Lambda Labs 的按需成本（2025 年 2 月 18 日观察）。

多流异步性能（使用 vLLM 版本 0.7.2 测量）

硬件	模型	平均成本降低	指令跟随 256 / 128 最大吞吐量 (QPS)	指令跟随 256 / 128 QPD	多轮对话 512 / 256 最大吞吐量 (QPS)	多轮对话 512 / 256 QPD	文档字符串生成 768 / 128 最大吞吐量 (QPS)	文档字符串生成 768 / 128 QPD	RAG 1024 / 128 最大吞吐量 (QPS)	RAG 1024 / 128 QPD	代码补全 256 / 1024 最大吞吐量 (QPS)	代码补全 256 / 1024 QPD	代码修复 1024 / 1024 最大吞吐量 (QPS)	代码修复 1024 / 1024 QPD	大摘要 4096 / 512 最大吞吐量 (QPS)	大摘要 4096 / 512 QPD	大 RAG 10240 / 1536 最大吞吐量 (QPS)	大 RAG 10240 / 1536 QPD
A6000x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.65	4102	1.56	1757	1.90	2143	1.48	1665	0.44	493	0.34	380	0.22	245	0.05	55
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.76	5.89	6625	2.94	3307	3.36	3775	2.59	2916	0.74	828	0.53	601	0.35	398	0.11	120
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.48	4.91	5528	2.01	2259	2.03	2280	1.12	1255	1.11	1251	0.76	852	0.24	267	0.07	81
A100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	10.41	5235	5.10	2565	5.50	2766	4.36	2193	1.49	751	1.21	607	0.89	447	0.19	98
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.63	18.11	9103	8.90	4477	9.41	4730	7.42	3731	2.44	1229	1.89	948	1.26	631	0.30	149
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.12	12.63	6353	5.32	2673	5.58	2804	4.27	2144	2.30	1158	1.45	729	0.76	381	0.22	110
H100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	14.04	2113	10.85	1634	12.25	1844	9.93	1494	3.68	554	2.82	425	1.81	273	0.35	52
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.78	41.44	6236	19.64	2956	21.03	3166	16.72	2516	6.01	904	4.46	672	2.55	383	0.49	74
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.45	36.61	5509	15.12	2275	16.24	2443	13.22	1990	5.48	825	3.01	453	2.07	312	0.43	64