DeepSeek-R1-Distill-Llama-70B-FP8-dynamic Open Source Model - Optimize Inference Performance for More Efficient Processing

Deepseek R1 Distill Llama 70B FP8 Dynamic

Developed by RedHatAI

The FP8 quantized version of DeepSeek-R1-Distill-Llama-70B, which optimizes inference performance by reducing the number of bits of weights and activations.

Large Language Model

Transformers

Open Source License:MIT #FP8 Quantization #Multi-GPU Inference #Efficient Deployment

Downloads 45.77k

Release Time : 2/1/2025

Model Overview

This is the quantized version of DeepSeek-R1-Distill-Llama-70B. By quantizing weights and activations to the FP8 data type, it reduces disk size and GPU memory requirements while significantly improving inference performance.

Model Features

FP8 Quantization

Both weights and activations are quantized using the FP8 data type, reducing disk size and GPU memory requirements by 50%.

Efficient Inference

Up to 1.4x acceleration can be achieved in single-stream deployment, and up to 3.0x acceleration can be achieved in multi-stream asynchronous deployment.

vLLM Compatibility

Supports efficient deployment using the vLLM backend and provides an OpenAI-compatible service interface.

Model Capabilities

Text Generation

Instruction Following

Multi-round Dialogue

Code Completion

Document Generation

RAG Application

Use Cases

Dialogue System

Multi-round Dialogue

Supports complex multi-round dialogue scenarios.

Reaches 8.90 QPS on A100x4 hardware under the 512/256 token configuration.

Code Generation

Code Completion

Supports the code completion function for programming languages.

The pass@1 reaches 81.00% in the HumanEval test.

Information Retrieval

RAG Application

Supports the question-answering system based on retrieval-augmented generation.

Reaches 7.42 QPS on A100x4 hardware under the 1024/128 token configuration.

🚀 DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

A quantized version of DeepSeek-R1-Distill-Llama-70B, optimized for reduced disk size and GPU memory requirements.

🚀 Quick Start

This is a quantized version of DeepSeek-R1-Distill-Llama-70B. It can be deployed efficiently using the vLLM backend.

✨ Features

Model Architecture: LlamaForCausalLM, taking text as input and outputting text.
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 2/1/2025
Version: 1.0
Model Developers: Neural Magic

This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

📦 Installation

No specific installation steps are provided in the original README. However, you can use the model with vLLM as shown in the usage examples.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Advanced Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import os

# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = model_stub.split("/")[-1]

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=2,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 Documentation

Evaluation

The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:

OpenLLM Leaderboard V1:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM Leaderboard V2:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

Accuracy

Category	Metric	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	Recovery
Reasoning	AIME 2024 (pass@1)	67.83	69.17	101.98%
Reasoning	MATH-500 (pass@1)	95.29	95.14	99.84%
Reasoning	GPQA Diamond (pass@1)	65.57	65.15	99.36%
Reasoning	Average Score	76.23	76.49	100.34%
OpenLLM V1	ARC-Challenge (Acc-Norm, 25-shot)	63.65	63.05	99.1%
OpenLLM V1	GSM8K (Strict-Match, 5-shot)	93.03	93.03	100.0%
OpenLLM V1	HellaSwag (Acc-Norm, 10-shot)	84.85	84.71	99.8%
OpenLLM V1	MMLU (Acc, 5-shot)	78.04	77.45	99.3%
OpenLLM V1	TruthfulQA (MC2, 0-shot)	56.67	56.62	99.9%
OpenLLM V1	Winogrande (Acc, 5-shot)	78.22	78.45	100.3%
OpenLLM V1	Average Score	75.74	75.55	99.8%
OpenLLM V2	IFEval (Inst Level Strict Acc, 0-shot)	42.45	42.11	99.2%
OpenLLM V2	BBH (Acc-Norm, 3-shot)	21.26	19.77	93.0%
OpenLLM V2	Math-Hard (Exact-Match, 4-shot)	0.00	0.00	---
OpenLLM V2	GPQA (Acc-Norm, 0-shot)	9.51	6.97	---
OpenLLM V2	MUSR (Acc-Norm, 0-shot)	14.87	14.60	---
OpenLLM V2	MMLU-Pro (Acc, 5-shot)	4.27	5.76	---
OpenLLM V2	Average Score	15.39	14.87	96.6%
Coding	HumanEval (pass@1)	81.10	81.00	99.9%
Coding	HumanEval (pass@10)	87.60	88.60	101.1%
Coding	HumanEval+ (pass@10)	75.20	75.50	100.4%
Coding	HumanEval+ (pass@10)	83.10	84.30	101.4%

Inference Performance

This model achieves up to 1.4x speedup in single-stream deployment and up to 3.0x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.

Benchmarking Command

guidellm --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server

Single-stream performance (measured with vLLM version 0.7.2)

GPU class	Number of GPUs	Model	Average cost reduction	Instruction Following 256 / 128 Latency (s)	Instruction Following 256 / 128 QPD	Multi-turn Chat 512 / 256 Latency (s)	Multi-turn Chat 512 / 256 QPD	Docstring Generation 768 / 128 Latency (s)	Docstring Generation 768 / 128 QPD	RAG 1024 / 128 Latency (s)	RAG 1024 / 128 QPD	Code Completion 256 / 1024 Latency (s)	Code Completion 256 / 1024 QPD	Code Fixing 1024 / 1024 Latency (s)	Code Fixing 1024 / 1024 QPD	Large Summarization 4096 / 512 Latency (s)	Large Summarization 4096 / 512 QPD	Large RAG 10240 / 1536 Latency (s)	Large RAG 10240 / 1536 QPD
A6000	4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	7.4	152	14.9	76	7.5	149	7.7	146	57.2	20	58.9	19	31.9	35	98.4	11
A6000	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.93	7.7	292	15.2	148	7.8	287	8.0	282	60.7	37	60.2	37	32.3	70	104.0	22
A6000	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.83	4.9	457	10.0	225	5.5	411	5.8	389	38.9	58	39.2	57	23.7	95	76.6	29
A100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	6.4	157	12.8	79	6.6	153	6.7	151	50.4	20	50.8	20	27.0	37	85.4	12
A100	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.48	4.1	245	8.2	123	4.2	238	4.3	235	32.4	31	32.8	31	17.6	57	90.8	11
A100	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	2.69	4.6	440	9.2	220	4.9	407	5.2	389	35.3	57	36.3	55	21.2	95	68.1	30
H100	2	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.8	149	7.6	74	3.9	146	3.9	144	30.0	19	30.4	19	16.1	35	56.5	10
H100	2	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.39	2.7	210	5.3	106	2.7	207	2.8	203	21.1	27	21.4	26	11.5	49	47.2	12
H100	1	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.83	4.0	277	7.9	138	4.1	266	4.2	262	31.2	35	31.8	34	17.8	61	61.4	18

**Use case profiles: prompt tokens / generation tokens

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Multi-stream asynchronous performance (measured with vLLM version 0.7.2)

Hardware	Model	Average cost reduction	Instruction Following 256 / 128 Maximum throughput (QPS)	Instruction Following 256 / 128 QPD	Multi-turn Chat 512 / 256 Maximum throughput (QPS)	Multi-turn Chat 512 / 256 QPD	Docstring Generation 768 / 128 Maximum throughput (QPS)	Docstring Generation 768 / 128 QPD	RAG 1024 / 128 Maximum throughput (QPS)	RAG 1024 / 128 QPD	Code Completion 256 / 1024 Maximum throughput (QPS)	Code Completion 256 / 1024 QPD	Code Fixing 1024 / 1024 Maximum throughput (QPS)	Code Fixing 1024 / 1024 QPD	Large Summarization 4096 / 512 Maximum throughput (QPS)	Large Summarization 4096 / 512 QPD	Large RAG 10240 / 1536 Maximum throughput (QPS)	Large RAG 10240 / 1536 QPD
A6000x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	3.65	4102	1.56	1757	1.90	2143	1.48	1665	0.44	493	0.34	380	0.22	245	0.05	55
A6000x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.76	5.89	6625	2.94	3307	3.36	3775	2.59	2916	0.74	828	0.53	601	0.35	398	0.11	120
A6000x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.48	4.91	5528	2.01	2259	2.03	2280	1.12	1255	1.11	1251	0.76	852	0.24	267	0.07	81
A100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	10.41	5235	5.10	2565	5.50	2766	4.36	2193	1.49	751	1.21	607	0.89	447	0.19	98
A100x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8	1.63	18.11	9103	8.90	4477	9.41	4730	7.42	3731	2.44	1229	1.89	948	1.26	631	0.30	149
A100x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.12	12.63	6353	5.32	2673	5.58	2804	4.27	2144	2.30	1158	1.45	729	0.76	381	0.22	110
H100x4	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	---	14.04	2113	10.85	1634	12.25	1844	9.93	1494	3.68	554	2.82	425	1.81	273	0.35	52
H100x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic	1.78	41.44	6236	19.64	2956	21.03	3166	16.72	2516	6.01	904	4.46	672	2.55	383	0.49	74
H100x4	neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	1.45	36.61	5509	15.12	2275	16.24	2443	13.22	1990	5.48	825	3.01	453	2.07	312	0.43	64

**Use case profiles: prompt tokens / generation tokens

**QPS: Queries per second.

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご