Model Overview
Model Features
Model Capabilities
Use Cases
đ DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8
This is a quantized version of DeepSeek-R1-Distill-Qwen-14B, which optimizes the model through quantization, reducing GPU memory requirements and increasing compute throughput.
đ Quick Start
This model can be deployed efficiently using the vLLM backend. Here is a simple example:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
⨠Features
Model Overview
- Model Architecture: Qwen2ForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: INT8
- Activation quantization: INT8
- Release Date: 2/4/2025
- Version: 1.0
- Model Developers: Neural Magic
Model Optimizations
This model was obtained by quantizing the weights and activations of DeepSeek-R1-Distill-Qwen-14B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.
Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.
đĻ Installation
This section mainly involves the creation process of the model. The model was created with llm-compressor by running the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_name = model_stub.split("/")[-1]
num_samples = 1024
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_stub)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map="auto",
torch_dtype="auto",
)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)
# Configure the quantization algorithm and scheme
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.1,
),
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w8a8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
đ Documentation
Evaluation
The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:
OpenLLM Leaderboard V1:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--tasks openllm \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
OpenLLM Leaderboard V2:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks leaderboard \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
Accuracy
Category | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | Recovery |
---|---|---|---|---|
Reasoning | AIME 2024 (pass@1) | 66.67 | 66.31 | 99.46% |
Reasoning | MATH-500 (pass@1) | 94.66 | 94.68 | 100.02% |
Reasoning | GPQA Diamond (pass@1) | 59.35 | 58.32 | 98.26% |
Reasoning | Average Score | 73.56 | 73.1 | 99.37% |
OpenLLM V1 | ARC-Challenge (Acc-Norm, 25-shot) | 58.79 | 57.85 | 98.4% |
OpenLLM V1 | GSM8K (Strict-Match, 5-shot) | 87.04 | 87.79 | 100.9% |
OpenLLM V1 | HellaSwag (Acc-Norm, 10-shot) | 81.51 | 81.04 | 99.4% |
OpenLLM V1 | MMLU (Acc, 5-shot) | 74.46 | 74.26 | 99.7% |
OpenLLM V1 | TruthfulQA (MC2, 0-shot) | 54.77 | 54.94 | 100.3% |
OpenLLM V1 | Winogrande (Acc, 5-shot) | 69.38 | 70.48 | 101.6% |
OpenLLM V1 | Average Score | 70.99 | 71.06 | 100.1% |
OpenLLM V2 | IFEval (Inst Level Strict Acc, 0-shot) | 42.11 | 41.62 | 98.6% |
OpenLLM V2 | BBH (Acc-Norm, 3-shot) | 13.73 | 14.29 | --- |
OpenLLM V2 | Math-Hard (Exact-Match, 4-shot) | 0.00 | 0.00 | --- |
OpenLLM V2 | GPQA (Acc-Norm, 0-shot) | 35.07 | 37.22 | 106.2% |
OpenLLM V2 | MUSR (Acc-Norm, 0-shot) | 45.14 | 43.56 | 96.5% |
OpenLLM V2 | MMLU-Pro (Acc, 5-shot) | 34.86 | 33.63 | 96.5% |
OpenLLM V2 | Average Score | 34.21 | 34.12 | 99.7% |
Coding | HumanEval (pass@1) | 78.90 | 78.40 | 99.4% |
Coding | HumanEval (pass@10) | 89.80 | 90.10 | 100.3% |
Coding | HumanEval+ (pass@10) | 72.60 | 72.40 | 99.7% |
Coding | HumanEval+ (pass@10) | 84.90 | 84.90 | 100.0% |
đ§ Technical Details
Inference Performance
This model achieves up to 1.6x speedup in both single-stream and multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.
Benchmarking Command
guidellm --model neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
Single-stream performance (measured with vLLM version 0.7.2)
Hardware | Model | Average cost reduction | Instruction Following 256 / 128 Latency (s) |
Instruction Following 256 / 128 QPD |
Multi-turn Chat 512 / 256 Latency (s) |
Multi-turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Latency (s) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Latency (s) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Latency (s) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Latency (s) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Latency (s) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Latency (s) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 5.4 | 837 | 10.7 | 419 | 5.5 | 813 | 5.6 | 805 | 42.2 | 107 | 42.8 | 105 | 22.9 | 197 | 71.7 | 63 |
A6000x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.59 | 3.3 | 1345 | 6.7 | 673 | 3.4 | 1315 | 3.5 | 1296 | 26.5 | 170 | 26.8 | 168 | 14.5 | 310 | 48.3 | 93 |
A6000x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 2.51 | 2.0 | 2275 | 4.0 | 1127 | 2.2 | 2072 | 2.3 | 1945 | 15.3 | 294 | 15.9 | 283 | 9.9 | 456 | 36.6 | 123 |
A100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 2.6 | 765 | 5.2 | 383 | 2.7 | 746 | 2.7 | 732 | 20.8 | 97 | 21.2 | 95 | 11.3 | 179 | 36.7 | 55 |
A100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.34 | 1.9 | 1072 | 3.8 | 533 | 1.9 | 1045 | 1.9 | 1032 | 14.8 | 136 | 15.2 | 132 | 8.1 | 248 | 39.6 | 51 |
A100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.93 | 1.2 | 1627 | 2.5 | 810 | 1.3 | 1530 | 1.4 | 1474 | 9.7 | 208 | 10.2 | 197 | 5.8 | 348 | 37.6 | 53 |
H100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 1.6 | 672 | 3.3 | 334 | 1.7 | 662 | 1.7 | 652 | 12.8 | 85 | 13.0 | 84 | 7.0 | 155 | 25.2 | 43 |
H100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic | 1.33 | 1.2 | 925 | 2.3 | 467 | 1.2 | 908 | 1.2 | 896 | 9.3 | 118 | 9.5 | 115 | 5.2 | 210 | 23.9 | 46 |
H100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.37 | 1.2 | 944 | 2.3 | 474 | 1.2 | 931 | 1.2 | 907 | 9.1 | 121 | 9.2 | 119 | 5.1 | 214 | 22.5 | 49 |
**Use case profiles: prompt tokens / generation tokens
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
Hardware | Model | Average cost reduction | Instruction Following 256 / 128 Maximum throughput (QPS) |
Instruction Following 256 / 128 QPD |
Multi-turn Chat 512 / 256 Maximum throughput (QPS) |
Multi-turn Chat 512 / 256 QPD |
Docstring Generation 768 / 128 Maximum throughput (QPS) |
Docstring Generation 768 / 128 QPD |
RAG 1024 / 128 Maximum throughput (QPS) |
RAG 1024 / 128 QPD |
Code Completion 256 / 1024 Maximum throughput (QPS) |
Code Completion 256 / 1024 QPD |
Code Fixing 1024 / 1024 Maximum throughput (QPS) |
Code Fixing 1024 / 1024 QPD |
Large Summarization 4096 / 512 Maximum throughput (QPS) |
Large Summarization 4096 / 512 QPD |
Large RAG 10240 / 1536 Maximum throughput (QPS) |
Large RAG 10240 / 1536 QPD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A6000x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 13.7 | 30785 | 5.5 | 12327 | 6.5 | 14517 | 5.1 | 11439 | 2.0 | 4434 | 1.3 | 2982 | 0.6 | 1462 | 0.2 | 371 |
A6000x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.44 | 21.4 | 48181 | 8.2 | 18421 | 9.8 | 22051 | 7.8 | 17462 | 2.8 | 6281 | 1.7 | 3758 | 1.0 | 2335 | 0.2 | 419 |
A6000x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 0.98 | 12.7 | 28540 | 5.7 | 12796 | 5.4 | 12218 | 3.7 | 8401 | 2.5 | 5583 | 1.3 | 2987 | 0.7 | 1489 | 0.2 | 368 |
A100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 15.6 | 31306 | 7.1 | 14192 | 7.7 | 15435 | 6.0 | 11971 | 2.4 | 4878 | 1.6 | 3298 | 0.9 | 1862 | 0.2 | 355 |
A100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.31 | 20.8 | 41907 | 9.3 | 18724 | 10.5 | 21043 | 8.4 | 16886 | 3.0 | 5975 | 1.9 | 3917 | 1.2 | 2481 | 0.2 | 464 |
A100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 0.94 | 14.0 | 28146 | 6.5 | 13042 | 6.5 | 12987 | 5.1 | 10194 | 2.6 | 5269 | 1.5 | 2925 | 0.9 | 1849 | 0.2 | 382 |
H100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 31.4 | 34404 | 14.1 | 15482 | 16.6 | 18149 | 13.3 | 14572 | 4.7 | 5099 | 2.6 | 2849 | 1.9 | 2060 | 0.3 | 347 |
H100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic | 1.31 | 40.9 | 44729 | 18.5 | 20260 | 22.1 | 24165 | 18.1 | 19779 | 5.7 | 6246 | 3.4 | 3681 | 2.5 | 2746 | 0.4 | 474 |
H100x1 | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.12 | 33.3 | 36387 | 15.0 | 16453 | 17.6 | 19241 | 14.2 | 15576 | 4.6 | 5034 | 3.0 | 3292 | 2.2 | 2412 | 0.4 | 481 |
**Use case profiles: prompt tokens / generation tokens
**QPS: Queries per second.
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
đ License
This project is licensed under the MIT license.

