DeepSeek-R1-Distill-Qwen-32B Quantized Open-Source Model - Reduce Video Memory Usage and Improve Computing Efficiency

Deepseek R1 Distill Qwen 32B Quantized.w8a8

Developed by neuralmagic

INT8 quantized version of DeepSeek-R1-Distill-Qwen-32B, reducing VRAM usage and improving computational efficiency through weight and activation quantization.

Large Language Model

Transformers

Open Source License:MIT #INT8 quantization #inference acceleration #VRAM optimization

Downloads 2,324

Release Time : 2/5/2025

Model Overview

Quantized model based on DeepSeek-R1-Distill-Qwen-32B, optimized with INT8 quantization for weights and activations, significantly lowering VRAM requirements and boosting inference speed.

Model Features

INT8 Quantization

Both weights and activations use INT8 quantization, reducing GPU VRAM usage by approximately 50% and improving matrix multiplication throughput by about 2x.

Efficient Inference

Supports efficient deployment via vLLM backend, optimizing inference performance for large-scale language models.

High Accuracy Retention

The quantized model maintains over 99% of the original model's accuracy across multiple benchmarks.

Model Capabilities

Text generation

Dialogue systems

Code generation

Mathematical reasoning

Use Cases

Dialogue systems

Intelligent customer service

Used to build efficient intelligent customer service systems for handling user queries.

Supports multi-turn dialogues with fast response times.

Code generation

Programming assistance

Helps developers generate code snippets or solve programming problems.

Achieves 85.8% pass@1 on the HumanEval benchmark.

🚀 DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

A quantized version of DeepSeek-R1-Distill-Qwen-32B, optimized for reduced GPU memory usage and increased compute throughput.

🚀 Quick Start

This quantized model offers significant advantages in terms of memory usage and computational efficiency. You can quickly deploy it using the vLLM backend, as demonstrated in the usage examples below.

✨ Features

Quantized Model: The weights and activations of this model are quantized to the INT8 data type, reducing GPU memory requirements by approximately 50% and increasing matrix - multiply compute throughput by approximately 2x.
Efficient Deployment: Can be efficiently deployed using the vLLM backend, which also supports OpenAI - compatible serving.
Good Performance: Achieves high accuracy on various benchmarks and up to 1.8x speedup in single - stream deployment and up to 2.2x speedup in multi - stream asynchronous deployment.

📦 Installation

To use this model, you need to have the necessary libraries installed. You can install them using pip:

pip install transformers vllm llmcompressor

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Advanced Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    QuantizationModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head"],
        dampening_frac=0.01,
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w8a8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 Documentation

Model Overview

Property	Details
Model Type	Qwen2ForCausalLM
Input	Text
Output	Text
Model Optimizations	Weight quantization: INT8; Activation quantization: INT8
Release Date	2/5/2025
Version	1.0
Model Developers	Neural Magic
Base Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

This model was obtained by quantizing the weights and activations of DeepSeek-R1-Distill-Qwen-32B to the INT8 data type. Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - channel scheme, whereas activations are quantized using a symmetric per - token scheme. The GPTQ algorithm is applied for quantization, as implemented in the llm - compressor library.

Evaluation

The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:

OpenLLM Leaderboard V1:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM Leaderboard V2:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

Accuracy

Category	Metric	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	Recovery
Reasoning	AIME 2024 (pass@1)	69.75	68.17	97.73%
	MATH - 500 (pass@1)	95.09	94.98	99.88%
	GPQA Diamond (pass@1)	64.05	64.75	101.09%
	Average Score	76.3	75.97	99.57%
OpenLLM V1	ARC - Challenge (Acc - Norm, 25 - shot)	64.59	64.08	99.2%
	GSM8K (Strict - Match, 5 - shot)	82.71	83.85	101.4%
	HellaSwag (Acc - Norm, 10 - shot)	83.80	83.66	99.8%
	MMLU (Acc, 5 - shot)	81.12	80.94	99.8%
	TruthfulQA (MC2, 0 - shot)	58.41	58.47	100.1%
	Winogrande (Acc, 5 - shot)	76.40	76.01	99.5%
	Average Score	74.51	74.50	100.0%
OpenLLM V2	IFEval (Inst Level Strict Acc, 0 - shot)	42.87	41.92	97.8%
	BBH (Acc - Norm, 3 - shot)	57.96	58.20	100.4%
	Math - Hard (Exact - Match, 4 - shot)	0.00	0.00	---
	GPQA (Acc - Norm, 0 - shot)	26.95	28.80	106.9%
	MUSR (Acc - Norm, 0 - shot)	43.95	43.95	100.0%
	MMLU - Pro (Acc, 5 - shot)	49.82	49.14	98.6%
	Average Score	36.92	37.00	100.2%
Coding	HumanEval (pass@1)	86.00	85.80	99.8%
	HumanEval (pass@10)	92.50	93.00	100.5%
	HumanEval+ (pass@10)	82.00	81.80	99.8%
	HumanEval+ (pass@10)	88.70	89.40	100.8%

Inference Performance

This model achieves up to 1.8x speedup in single - stream deployment and up to 2.2x speedup in multi - stream asynchronous deployment, depending on hardware and use - case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.

Benchmarking Command

guidellm --model neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server

Single - stream performance (measured with vLLM version 0.7.2)

GPU class	Number of GPUs	Model	Average cost reduction	Instruction Following 256 / 128 Latency (s)	Instruction Following 256 / 128 QPD	Multi - turn Chat 512 / 256 Latency (s)	Multi - turn Chat 512 / 256 QPD	Docstring Generation 768 / 128 Latency (s)	Docstring Generation 768 / 128 QPD	RAG 1024 / 128 Latency (s)	RAG 1024 / 128 QPD	Code Completion 256 / 1024 Latency (s)	Code Completion 256 / 1024 QPD	Code Fixing 1024 / 1024 Latency (s)	Code Fixing 1024 / 1024 QPD	Large Summarization 4096 / 512 Latency (s)	Large Summarization 4096 / 512 QPD	Large RAG 10240 / 1536 Latency (s)	Large RAG 10240 / 1536 QPD
A6000	2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	6.3	359	12.8	176	6.5	347	6.6	342	49.9	45	50.8	44	26.6	85	83.4	27
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.81	6.9	648	13.8	325	7.2	629	7.2	622	54.8	82	55.6	81	30.0	150	94.8	47
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	3.07	3.9	1168	7.8	580	4.3	1041	4.6	975	29.7	151	30.9	146	19.3	233	61.4	73
A100	1	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	5.6	361	11.1	180	5.7	350	5.8	347	44.0	46	44.7	45	23.6	85	73.7	27
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.50	3.7	547	7.3	275	3.8	536	3.8	528	29.0	69	29.5	68	15.7	128	53.1	38
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	2.30	2.2	894	4.5	449	2.4	831	2.5	798	17.4	116	18.0	112	10.5	191	49.5	41
H100	1	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	3.3	327	6.7	163	3.4	320	3.4	317	26.6	41	26.9	41	14.3	77	47.8	23
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic	1.52	2.2	503	4.3	252	2.2	490	2.3	485	17.3	63	17.5	63	9.5	116	33.4	33
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.61	2.1	532	4.1	268	2.1	516	2.1	513	16.1	68	16.5	66	9.1	120	31.9	34

Use case profiles: prompt tokens / generation tokens

QPD: Queries per dollar, based on on - demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu - cloud) (observed on 2/18/2025).

Multi - stream asynchronous performance (measured with vLLM version 0.7.2)

Hardware	Model	Average cost reduction	Instruction Following 256 / 128 Maximum throughput (QPS)	Instruction Following 256 / 128 QPD	Multi - turn Chat 512 / 256 Maximum throughput (QPS)	Multi - turn Chat 512 / 256 QPD	Docstring Generation 768 / 128 Maximum throughput (QPS)	Docstring Generation 768 / 128 QPD	RAG 1024 / 128 Maximum throughput (QPS)	RAG 1024 / 128 QPD	Code Completion 256 / 1024 Maximum throughput (QPS)	Code Completion 256 / 1024 QPD	Code Fixing 1024 / 1024 Maximum throughput (QPS)	Code Fixing 1024 / 1024 QPD	Large Summarization 4096 / 512 Maximum throughput (QPS)	Large Summarization 4096 / 512 QPD	Large RAG 10240 / 1536 Maximum throughput (QPS)	Large RAG 10240 / 1536 QPD
A6000x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	6.2	13940	1.9	4348	2.7	6153	2.1	4778	0.6	1382	0.4	930	0.3	685	0.1	124
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.80	8.7	19492	4.2	9474	4.1	9290	3.0	6802	1.2	2734	0.9	1962	0.5	1177	0.1	254
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.30	5.9	13366	2.5	5733	2.4	5409	1.6	3525	1.2	2757	0.7	1663	0.3	676	0.1	214
A100x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	12.9	13016	5.8	5848	6.3	6348	5.1	5146	2.0	1988	1.5	1463	0.9	869	0.2	192
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.52	21.4	21479	8.9	8948	10.6	10611	8.2	8197	3.0	3018	2.0	2054	1.2	1241	0.3	264
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.09	13.5	13568	6.5	6509	6.0	6075	4.7	4754	2.8	2790	1.6	1651	0.9	862	0.2	225
H100x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	25.5	14392	12.5	7035	14.0	7877	11.3	6364	3.6	2041	2.7	1549	1.9	1057	0.4	200
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic	1.46	46.7	25538	20.3	11082	23.3	12728	18.4	10049	5.3	2881	3.7	2097	2.6	1445	0.5	256
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.23	36.9	20172	17.4	9500	18.0	9822	14.2	7755	5.3	2900	3.3	1867	2.3	1265	0.4	241

Use case profiles: prompt tokens / generation tokens

🔧 Technical Details

The quantization process is based on the GPTQ algorithm, as implemented in the llm - compressor library. Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - channel scheme, and activations are quantized using a symmetric per - token scheme.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご