DeepSeek-R1-Distill-Llama-70B-FP8-dynamic開源模型 - 優化推理性能讓處理更高效

首頁

Deepseek R1 Distill Llama 70B FP8 Dynamic

由RedHatAI開發

DeepSeek-R1-Distill-Llama-70B的FP8量化版本，通過減少權重和激活的位數來優化推理性能

大型語言模型

Transformers

開源協議:MIT #FP8量化 #多GPU推理 #高效部署

下載量 45.77k

發布時間 : 2/1/2025

模型概述

這是DeepSeek-R1-Distill-Llama-70B的量化版本，通過將權重和激活量化為FP8數據類型，減少了磁盤大小和GPU內存需求，同時在推理性能上有顯著提升。

模型特點

FP8量化

權重和激活均使用FP8數據類型進行量化，減少50%的磁盤大小和GPU內存需求

高效推理

在單流部署中最高可實現1.4倍加速，在多流異步部署中最高可實現3.0倍加速

vLLM兼容

支持使用vLLM後端進行高效部署，提供OpenAI兼容的服務接口

模型能力

文本生成

指令跟隨

多輪對話

代碼補全

文檔生成

RAG應用

使用案例

對話系統

多輪對話

支持複雜的多輪對話場景

在512/256令牌配置下，A100x4硬件上達到8.90 QPS

代碼生成

代碼補全

支持編程語言的代碼補全功能

HumanEval測試中pass@1達到81.00%

信息檢索

RAG應用

支持基於檢索增強生成的問答系統

在1024/128令牌配置下，A100x4硬件上達到7.42 QPS

🚀 DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

這是 DeepSeek-R1-Distill-Llama-70B 的量化版本，通過將權重和激活量化為 FP8 數據類型，減少了磁盤大小和 GPU 內存需求，同時在推理性能上有顯著提升。

🚀 快速開始

使用 vLLM 部署模型

此模型可以使用 vLLM 後端進行高效部署，示例代碼如下：

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

vLLM 還支持與 OpenAI 兼容的服務，更多詳細信息請參閱文檔。

✨ 主要特性

模型架構：LlamaForCausalLM，輸入和輸出均為文本。
模型優化：
- 權重量化：FP8
- 激活量化：FP8
發佈日期：2025 年 2 月 1 日
版本：1.0
模型開發者：Neural Magic

通過將 DeepSeek-R1-Distill-Llama-70B 的權重和激活量化為 FP8 數據類型，該優化將每個參數的位數從 16 位減少到 8 位，大約減少了 50% 的磁盤大小和 GPU 內存需求。僅對 Transformer 塊內線性算子的權重和激活進行量化，權重使用對稱的每通道方案進行量化，激活使用對稱的每令牌方案進行量化，使用 LLM Compressor 進行量化。

📦 安裝指南

文檔未提及具體安裝步驟，可參考 vLLM 和相關依賴的官方文檔進行安裝。

💻 使用示例

基礎用法

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

模型創建示例

此模型使用 llm-compressor 創建，運行以下代碼片段：

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import os

# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = model_stub.split("/")[-1]

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=2,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 詳細文檔

評估

該模型在 OpenLLM 排行榜 V1 和 V2 上進行了評估，使用以下命令：

OpenLLM 排行榜 V1：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM 排行榜 V2：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

準確率

類別	指標	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	恢復率
推理	AIME 2024 (pass@1)	67.83	69.17	101.98%
	MATH-500 (pass@1)	95.29	95.14	99.84%
	GPQA Diamond (pass@1)	65.57	65.15	99.36%
	平均得分	76.23	76.49	100.34%
OpenLLM V1	ARC-Challenge (Acc-Norm, 25-shot)	63.65	63.05	99.1%
	GSM8K (Strict-Match, 5-shot)	93.03	93.03	100.0%
	HellaSwag (Acc-Norm, 10-shot)	84.85	84.71	99.8%
	MMLU (Acc, 5-shot)	78.04	77.45	99.3%
	TruthfulQA (MC2, 0-shot)	56.67	56.62	99.9%
	Winogrande (Acc, 5-shot)	78.22	78.45	100.3%
	平均得分	75.74	75.55	99.8%
OpenLLM V2	IFEval (Inst Level Strict Acc, 0-shot)	42.45	42.11	99.2%
	BBH (Acc-Norm, 3-shot)	21.26	19.77	93.0%
	Math-Hard (Exact-Match, 4-shot)	0.00	0.00	---
	GPQA (Acc-Norm, 0-shot)	9.51	6.97	---
	MUSR (Acc-Norm, 0-shot)	14.87	14.60	---
	MMLU-Pro (Acc, 5-shot)	4.27	5.76	---
	平均得分	15.39	14.87	96.6%
編碼	HumanEval (pass@1)	81.10	81.00	99.9%
	HumanEval (pass@10)	87.60	88.60	101.1%
	HumanEval+ (pass@10)	75.20	75.50	100.4%
	HumanEval+ (pass@10)	83.10	84.30	101.4%

推理性能

此模型在單流部署中最高可實現 1.4 倍加速，在多流異步部署中最高可實現 3.0 倍加速，具體取決於硬件和用例場景。以下性能基準測試使用 vLLM 版本 0.7.2 和 GuideLLM 進行。

基準測試命令

guidellm --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server

單流性能（使用 vLLM 版本 0.7.2 測量）

GPU 類別	GPU 數量	模型	平均成本降低	指令跟隨 256 / 128 延遲 (s)	指令跟隨 256 / 128 QPD	多輪對話 512 / 256 延遲 (s)	多輪對話 512 / 256 QPD	文檔字符串生成 768 / 128 延遲 (s)	文檔字符串生成 768 / 128 QPD	RAG 1024 / 128 延遲 (s)	RAG 1024 / 128 QPD	代碼補全 256 / 1024 延遲 (s)	代碼補全 256 / 1024 QPD	代碼修復 1024 / 1024 延遲 (s)	代碼修復 1024 / 1024 QPD	大摘要 4096 / 512 延遲 (s)	大摘要 4096 / 512 QPD	大 RAG 10240 / 1536 延遲 (s)	大 RAG 10240 / 1536 QPD
A6000	4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	7.4	152	14.9	76	7.5	149	7.7	146	57.2	20	58.9	19	31.9	35	98.4	11
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.93	7.7	292	15.2	148	7.8	287	8.0	282	60.7	37	60.2	37	32.3	70	104.0	22
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.83	4.9	457	10.0	225	5.5	411	5.8	389	38.9	58	39.2	57	23.7	95	76.6	29
A100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	6.4	157	12.8	79	6.6	153	6.7	151	50.4	20	50.8	20	27.0	37	85.4	12
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.48	4.1	245	8.2	123	4.2	238	4.3	235	32.4	31	32.8	31	17.6	57	90.8	11
	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.69	4.6	440	9.2	220	4.9	407	5.2	389	35.3	57	36.3	55	21.2	95	68.1	30
H100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.8	149	7.6	74	3.9	146	3.9	144	30.0	19	30.4	19	16.1	35	56.5	10
	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.39	2.7	210	5.3	106	2.7	207	2.8	203	21.1	27	21.4	26	11.5	49	47.2	12
	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.83	4.0	277	7.9	138	4.1	266	4.2	262	31.2	35	31.8	34	17.8	61	61.4	18

用例配置文件：提示令牌 / 生成令牌

**QPD：每美元查詢次數，基於 Lambda Labs 的按需成本（2025 年 2 月 18 日觀察）。

多流異步性能（使用 vLLM 版本 0.7.2 測量）

硬件	模型	平均成本降低	指令跟隨 256 / 128 最大吞吐量 (QPS)	指令跟隨 256 / 128 QPD	多輪對話 512 / 256 最大吞吐量 (QPS)	多輪對話 512 / 256 QPD	文檔字符串生成 768 / 128 最大吞吐量 (QPS)	文檔字符串生成 768 / 128 QPD	RAG 1024 / 128 最大吞吐量 (QPS)	RAG 1024 / 128 QPD	代碼補全 256 / 1024 最大吞吐量 (QPS)	代碼補全 256 / 1024 QPD	代碼修復 1024 / 1024 最大吞吐量 (QPS)	代碼修復 1024 / 1024 QPD	大摘要 4096 / 512 最大吞吐量 (QPS)	大摘要 4096 / 512 QPD	大 RAG 10240 / 1536 最大吞吐量 (QPS)	大 RAG 10240 / 1536 QPD
A6000x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.65	4102	1.56	1757	1.90	2143	1.48	1665	0.44	493	0.34	380	0.22	245	0.05	55
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.76	5.89	6625	2.94	3307	3.36	3775	2.59	2916	0.74	828	0.53	601	0.35	398	0.11	120
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.48	4.91	5528	2.01	2259	2.03	2280	1.12	1255	1.11	1251	0.76	852	0.24	267	0.07	81
A100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	10.41	5235	5.10	2565	5.50	2766	4.36	2193	1.49	751	1.21	607	0.89	447	0.19	98
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.63	18.11	9103	8.90	4477	9.41	4730	7.42	3731	2.44	1229	1.89	948	1.26	631	0.30	149
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.12	12.63	6353	5.32	2673	5.58	2804	4.27	2144	2.30	1158	1.45	729	0.76	381	0.22	110
H100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	14.04	2113	10.85	1634	12.25	1844	9.93	1494	3.68	554	2.82	425	1.81	273	0.35	52
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.78	41.44	6236	19.64	2956	21.03	3166	16.72	2516	6.01	904	4.46	672	2.55	383	0.49	74
	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.45	36.61	5509	15.12	2275	16.24	2443	13.22	1990	5.48	825	3.01	453	2.07	312	0.43	64