Model Overview
Model Features
Model Capabilities
Use Cases
đ DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
A quantized version of DeepSeek-R1-Distill-Llama-70B, optimized for reduced disk size and GPU memory requirements.
đ Quick Start
This is a quantized version of DeepSeek-R1-Distill-Llama-70B. It can be deployed efficiently using the vLLM backend.
⨠Features
- Model Architecture: LlamaForCausalLM, taking text as input and outputting text.
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Release Date: 2/1/2025
- Version: 1.0
- Model Developers: Neural Magic
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
đĻ Installation
No specific installation steps are provided in the original README. However, you can use the model with vLLM as shown in the usage examples.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 2
model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
Advanced Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import os
# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = model_stub.split("/")[-1]
device_map = calculate_offload_device_map(
model_stub,
reserve_for_hessians=True,
num_gpus=2,
torch_dtype="auto",
)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map=device_map,
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_stub)
# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"],
)
# Apply quantization
oneshot(
model=model,
recipe=recipe,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
đ Documentation
Evaluation
The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:
OpenLLM Leaderboard V1:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
--tasks openllm \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
OpenLLM Leaderboard V2:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks leaderboard \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
Accuracy
Category | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic | Recovery |
---|---|---|---|---|
Reasoning | AIME 2024 (pass@1) | 67.83 | 69.17 | 101.98% |
Reasoning | MATH-500 (pass@1) | 95.29 | 95.14 | 99.84% |
Reasoning | GPQA Diamond (pass@1) | 65.57 | 65.15 | 99.36% |
Reasoning | Average Score | 76.23 | 76.49 | 100.34% |
OpenLLM V1 | ARC-Challenge (Acc-Norm, 25-shot) | 63.65 | 63.05 | 99.1% |
OpenLLM V1 | GSM8K (Strict-Match, 5-shot) | 93.03 | 93.03 | 100.0% |
OpenLLM V1 | HellaSwag (Acc-Norm, 10-shot) | 84.85 | 84.71 | 99.8% |
OpenLLM V1 | MMLU (Acc, 5-shot) | 78.04 | 77.45 | 99.3% |
OpenLLM V1 | TruthfulQA (MC2, 0-shot) | 56.67 | 56.62 | 99.9% |
OpenLLM V1 | Winogrande (Acc, 5-shot) | 78.22 | 78.45 | 100.3% |
OpenLLM V1 | Average Score | 75.74 | 75.55 | 99.8% |
OpenLLM V2 | IFEval (Inst Level Strict Acc, 0-shot) | 42.45 | 42.11 | 99.2% |
OpenLLM V2 | BBH (Acc-Norm, 3-shot) | 21.26 | 19.77 | 93.0% |
OpenLLM V2 | Math-Hard (Exact-Match, 4-shot) | 0.00 | 0.00 | --- |
OpenLLM V2 | GPQA (Acc-Norm, 0-shot) | 9.51 | 6.97 | --- |
OpenLLM V2 | MUSR (Acc-Norm, 0-shot) | 14.87 | 14.60 | --- |
OpenLLM V2 | MMLU-Pro (Acc, 5-shot) | 4.27 | 5.76 | --- |
OpenLLM V2 | Average Score | 15.39 | 14.87 | 96.6% |
Coding | HumanEval (pass@1) | 81.10 | 81.00 | 99.9% |
Coding | HumanEval (pass@10) | 87.60 | 88.60 | 101.1% |
Coding | HumanEval+ (pass@10) | 75.20 | 75.50 | 100.4% |
Coding | HumanEval+ (pass@10) | 83.10 | 84.30 | 101.4% |
Inference Performance
This model achieves up to 1.4x speedup in single-stream deployment and up to 3.0x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.
Benchmarking Command
guidellm --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
Single-stream performance (measured with vLLM version 0.7.2)
GPU class | Number of GPUs | Model | Average cost reduction | Instruction Following 256 / 128 Latency (s) |
Instruction Following 256 / 128 QPD |
Multi-turn Chat 512 / 256 Latency (s) |
Multi-turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Latency (s) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Latency (s) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Latency (s) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Latency (s) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Latency (s) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Latency (s) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000 | 4 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 7.4 | 152 | 14.9 | 76 | 7.5 | 149 | 7.7 | 146 | 57.2 | 20 | 58.9 | 19 | 31.9 | 35 | 98.4 | 11 |
A6000 | 2 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | 1.93 | 7.7 | 292 | 15.2 | 148 | 7.8 | 287 | 8.0 | 282 | 60.7 | 37 | 60.2 | 37 | 32.3 | 70 | 104.0 | 22 |
A6000 | 2 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 2.83 | 4.9 | 457 | 10.0 | 225 | 5.5 | 411 | 5.8 | 389 | 38.9 | 58 | 39.2 | 57 | 23.7 | 95 | 76.6 | 29 |
A100 | 2 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 6.4 | 157 | 12.8 | 79 | 6.6 | 153 | 6.7 | 151 | 50.4 | 20 | 50.8 | 20 | 27.0 | 37 | 85.4 | 12 |
A100 | 2 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | 1.48 | 4.1 | 245 | 8.2 | 123 | 4.2 | 238 | 4.3 | 235 | 32.4 | 31 | 32.8 | 31 | 17.6 | 57 | 90.8 | 11 |
A100 | 1 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 2.69 | 4.6 | 440 | 9.2 | 220 | 4.9 | 407 | 5.2 | 389 | 35.3 | 57 | 36.3 | 55 | 21.2 | 95 | 68.1 | 30 |
H100 | 2 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 3.8 | 149 | 7.6 | 74 | 3.9 | 146 | 3.9 | 144 | 30.0 | 19 | 30.4 | 19 | 16.1 | 35 | 56.5 | 10 |
H100 | 2 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic | 1.39 | 2.7 | 210 | 5.3 | 106 | 2.7 | 207 | 2.8 | 203 | 21.1 | 27 | 21.4 | 26 | 11.5 | 49 | 47.2 | 12 |
H100 | 1 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 1.83 | 4.0 | 277 | 7.9 | 138 | 4.1 | 266 | 4.2 | 262 | 31.2 | 35 | 31.8 | 34 | 17.8 | 61 | 61.4 | 18 |
**Use case profiles: prompt tokens / generation tokens
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
Hardware | Model | Average cost reduction | Instruction Following 256 / 128 Maximum throughput (QPS) |
Instruction Following 256 / 128 QPD |
Multi-turn Chat 512 / 256 Maximum throughput (QPS) |
Multi-turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Maximum throughput (QPS) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Maximum throughput (QPS) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Maximum throughput (QPS) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Maximum throughput (QPS) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Maximum throughput (QPS) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Maximum throughput (QPS) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000x4 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 3.65 | 4102 | 1.56 | 1757 | 1.90 | 2143 | 1.48 | 1665 | 0.44 | 493 | 0.34 | 380 | 0.22 | 245 | 0.05 | 55 |
A6000x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | 1.76 | 5.89 | 6625 | 2.94 | 3307 | 3.36 | 3775 | 2.59 | 2916 | 0.74 | 828 | 0.53 | 601 | 0.35 | 398 | 0.11 | 120 |
A6000x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 1.48 | 4.91 | 5528 | 2.01 | 2259 | 2.03 | 2280 | 1.12 | 1255 | 1.11 | 1251 | 0.76 | 852 | 0.24 | 267 | 0.07 | 81 |
A100x4 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 10.41 | 5235 | 5.10 | 2565 | 5.50 | 2766 | 4.36 | 2193 | 1.49 | 751 | 1.21 | 607 | 0.89 | 447 | 0.19 | 98 |
A100x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | 1.63 | 18.11 | 9103 | 8.90 | 4477 | 9.41 | 4730 | 7.42 | 3731 | 2.44 | 1229 | 1.89 | 948 | 1.26 | 631 | 0.30 | 149 |
A100x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 1.12 | 12.63 | 6353 | 5.32 | 2673 | 5.58 | 2804 | 4.27 | 2144 | 2.30 | 1158 | 1.45 | 729 | 0.76 | 381 | 0.22 | 110 |
H100x4 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | --- | 14.04 | 2113 | 10.85 | 1634 | 12.25 | 1844 | 9.93 | 1494 | 3.68 | 554 | 2.82 | 425 | 1.81 | 273 | 0.35 | 52 |
H100x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic | 1.78 | 41.44 | 6236 | 19.64 | 2956 | 21.03 | 3166 | 16.72 | 2516 | 6.01 | 904 | 4.46 | 672 | 2.55 | 383 | 0.49 | 74 |
H100x4 | neuralmagic/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 | 1.45 | 36.61 | 5509 | 15.12 | 2275 | 16.24 | 2443 | 13.22 | 1990 | 5.48 | 825 | 3.01 | 453 | 2.07 | 312 | 0.43 | 64 |
**Use case profiles: prompt tokens / generation tokens
**QPS: Queries per second.
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
đ License
This project is licensed under the MIT License.

