Model Overview
Model Features
Model Capabilities
Use Cases
đ DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8
A quantized version of DeepSeek-R1-Distill-Qwen-32B, optimized for reduced GPU memory usage and increased compute throughput.
đ Quick Start
This quantized model offers significant advantages in terms of memory usage and computational efficiency. You can quickly deploy it using the vLLM backend, as demonstrated in the usage examples below.
⨠Features
- Quantized Model: The weights and activations of this model are quantized to the INT8 data type, reducing GPU memory requirements by approximately 50% and increasing matrix - multiply compute throughput by approximately 2x.
- Efficient Deployment: Can be efficiently deployed using the vLLM backend, which also supports OpenAI - compatible serving.
- Good Performance: Achieves high accuracy on various benchmarks and up to 1.8x speedup in single - stream deployment and up to 2.2x speedup in multi - stream asynchronous deployment.
đĻ Installation
To use this model, you need to have the necessary libraries installed. You can install them using pip
:
pip install transformers vllm llmcompressor
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
Advanced Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
model_name = model_stub.split("/")[-1]
num_samples = 1024
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_stub)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map="auto",
torch_dtype="auto",
)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)
# Configure the quantization algorithm and scheme
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
),
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w8a8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
đ Documentation
Model Overview
Property | Details |
---|---|
Model Type | Qwen2ForCausalLM |
Input | Text |
Output | Text |
Model Optimizations | Weight quantization: INT8; Activation quantization: INT8 |
Release Date | 2/5/2025 |
Version | 1.0 |
Model Developers | Neural Magic |
Base Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
This model was obtained by quantizing the weights and activations of DeepSeek-R1-Distill-Qwen-32B to the INT8 data type. Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - channel scheme, whereas activations are quantized using a symmetric per - token scheme. The GPTQ algorithm is applied for quantization, as implemented in the llm - compressor library.
Evaluation
The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:
OpenLLM Leaderboard V1:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--tasks openllm \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
OpenLLM Leaderboard V2:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks leaderboard \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
Accuracy
Category | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 | Recovery |
---|---|---|---|---|
Reasoning | AIME 2024 (pass@1) | 69.75 | 68.17 | 97.73% |
MATH - 500 (pass@1) | 95.09 | 94.98 | 99.88% | |
GPQA Diamond (pass@1) | 64.05 | 64.75 | 101.09% | |
Average Score | 76.3 | 75.97 | 99.57% | |
OpenLLM V1 | ARC - Challenge (Acc - Norm, 25 - shot) | 64.59 | 64.08 | 99.2% |
GSM8K (Strict - Match, 5 - shot) | 82.71 | 83.85 | 101.4% | |
HellaSwag (Acc - Norm, 10 - shot) | 83.80 | 83.66 | 99.8% | |
MMLU (Acc, 5 - shot) | 81.12 | 80.94 | 99.8% | |
TruthfulQA (MC2, 0 - shot) | 58.41 | 58.47 | 100.1% | |
Winogrande (Acc, 5 - shot) | 76.40 | 76.01 | 99.5% | |
Average Score | 74.51 | 74.50 | 100.0% | |
OpenLLM V2 | IFEval (Inst Level Strict Acc, 0 - shot) | 42.87 | 41.92 | 97.8% |
BBH (Acc - Norm, 3 - shot) | 57.96 | 58.20 | 100.4% | |
Math - Hard (Exact - Match, 4 - shot) | 0.00 | 0.00 | --- | |
GPQA (Acc - Norm, 0 - shot) | 26.95 | 28.80 | 106.9% | |
MUSR (Acc - Norm, 0 - shot) | 43.95 | 43.95 | 100.0% | |
MMLU - Pro (Acc, 5 - shot) | 49.82 | 49.14 | 98.6% | |
Average Score | 36.92 | 37.00 | 100.2% | |
Coding | HumanEval (pass@1) | 86.00 | 85.80 | 99.8% |
HumanEval (pass@10) | 92.50 | 93.00 | 100.5% | |
HumanEval+ (pass@10) | 82.00 | 81.80 | 99.8% | |
HumanEval+ (pass@10) | 88.70 | 89.40 | 100.8% |
Inference Performance
This model achieves up to 1.8x speedup in single - stream deployment and up to 2.2x speedup in multi - stream asynchronous deployment, depending on hardware and use - case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.
Benchmarking Command
guidellm --model neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
Single - stream performance (measured with vLLM version 0.7.2)
GPU class | Number of GPUs | Model | Average cost reduction | Instruction Following 256 / 128 Latency (s) |
Instruction Following 256 / 128 QPD |
Multi - turn Chat 512 / 256 Latency (s) |
Multi - turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Latency (s) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Latency (s) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Latency (s) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Latency (s) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Latency (s) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Latency (s) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000 | 2 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 6.3 | 359 | 12.8 | 176 | 6.5 | 347 | 6.6 | 342 | 49.9 | 45 | 50.8 | 44 | 26.6 | 85 | 83.4 | 27 |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 | 1.81 | 6.9 | 648 | 13.8 | 325 | 7.2 | 629 | 7.2 | 622 | 54.8 | 82 | 55.6 | 81 | 30.0 | 150 | 94.8 | 47 | |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 3.07 | 3.9 | 1168 | 7.8 | 580 | 4.3 | 1041 | 4.6 | 975 | 29.7 | 151 | 30.9 | 146 | 19.3 | 233 | 61.4 | 73 | |
A100 | 1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 5.6 | 361 | 11.1 | 180 | 5.7 | 350 | 5.8 | 347 | 44.0 | 46 | 44.7 | 45 | 23.6 | 85 | 73.7 | 27 |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 | 1.50 | 3.7 | 547 | 7.3 | 275 | 3.8 | 536 | 3.8 | 528 | 29.0 | 69 | 29.5 | 68 | 15.7 | 128 | 53.1 | 38 | |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 2.30 | 2.2 | 894 | 4.5 | 449 | 2.4 | 831 | 2.5 | 798 | 17.4 | 116 | 18.0 | 112 | 10.5 | 191 | 49.5 | 41 | |
H100 | 1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 3.3 | 327 | 6.7 | 163 | 3.4 | 320 | 3.4 | 317 | 26.6 | 41 | 26.9 | 41 | 14.3 | 77 | 47.8 | 23 |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic | 1.52 | 2.2 | 503 | 4.3 | 252 | 2.2 | 490 | 2.3 | 485 | 17.3 | 63 | 17.5 | 63 | 9.5 | 116 | 33.4 | 33 | |
1 | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 1.61 | 2.1 | 532 | 4.1 | 268 | 2.1 | 516 | 2.1 | 513 | 16.1 | 68 | 16.5 | 66 | 9.1 | 120 | 31.9 | 34 |
Use case profiles: prompt tokens / generation tokens
QPD: Queries per dollar, based on on - demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu - cloud) (observed on 2/18/2025).
Multi - stream asynchronous performance (measured with vLLM version 0.7.2)
Hardware | Model | Average cost reduction | Instruction Following 256 / 128 Maximum throughput (QPS) |
Instruction Following 256 / 128 QPD |
Multi - turn Chat 512 / 256 Maximum throughput (QPS) |
Multi - turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Maximum throughput (QPS) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Maximum throughput (QPS) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Maximum throughput (QPS) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Maximum throughput (QPS) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Maximum throughput (QPS) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Maximum throughput (QPS) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000x2 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 6.2 | 13940 | 1.9 | 4348 | 2.7 | 6153 | 2.1 | 4778 | 0.6 | 1382 | 0.4 | 930 | 0.3 | 685 | 0.1 | 124 |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 | 1.80 | 8.7 | 19492 | 4.2 | 9474 | 4.1 | 9290 | 3.0 | 6802 | 1.2 | 2734 | 0.9 | 1962 | 0.5 | 1177 | 0.1 | 254 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 1.30 | 5.9 | 13366 | 2.5 | 5733 | 2.4 | 5409 | 1.6 | 3525 | 1.2 | 2757 | 0.7 | 1663 | 0.3 | 676 | 0.1 | 214 | |
A100x2 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 12.9 | 13016 | 5.8 | 5848 | 6.3 | 6348 | 5.1 | 5146 | 2.0 | 1988 | 1.5 | 1463 | 0.9 | 869 | 0.2 | 192 |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 | 1.52 | 21.4 | 21479 | 8.9 | 8948 | 10.6 | 10611 | 8.2 | 8197 | 3.0 | 3018 | 2.0 | 2054 | 1.2 | 1241 | 0.3 | 264 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 1.09 | 13.5 | 13568 | 6.5 | 6509 | 6.0 | 6075 | 4.7 | 4754 | 2.8 | 2790 | 1.6 | 1651 | 0.9 | 862 | 0.2 | 225 | |
H100x2 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | --- | 25.5 | 14392 | 12.5 | 7035 | 14.0 | 7877 | 11.3 | 6364 | 3.6 | 2041 | 2.7 | 1549 | 1.9 | 1057 | 0.4 | 200 |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic | 1.46 | 46.7 | 25538 | 20.3 | 11082 | 23.3 | 12728 | 18.4 | 10049 | 5.3 | 2881 | 3.7 | 2097 | 2.6 | 1445 | 0.5 | 256 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16 | 1.23 | 36.9 | 20172 | 17.4 | 9500 | 18.0 | 9822 | 14.2 | 7755 | 5.3 | 2900 | 3.3 | 1867 | 2.3 | 1265 | 0.4 | 241 |
Use case profiles: prompt tokens / generation tokens
đ§ Technical Details
The quantization process is based on the GPTQ algorithm, as implemented in the llm - compressor library. Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - channel scheme, and activations are quantized using a symmetric per - token scheme.
đ License
This project is licensed under the MIT License.

