Deepseek R1 Distill Qwen 32B Quantized.w8a8

neuralmagicによって開発

DeepSeek-R1-Distill-Qwen-32BのINT8量子化バージョンで、重み量子化と活性化値量子化によりVRAM使用量を削減し計算効率を向上。

大規模言語モデル

Transformers

オープンソースライセンス:MIT #INT8量子化 #推論加速 #VRAM最適化

ダウンロード数 2,324

リリース時間 : 2/5/2025

モデル概要

DeepSeek-R1-Distill-Qwen-32Bを基にした量子化モデルで、INT8量子化技術により重みと活性化値を最適化し、VRAM要件を大幅に削減し推論速度を向上。

モデル特徴

INT8量子化

重みと活性化値ともにINT8量子化を採用し、GPUのVRAM使用量を約50%削減、行列乗算のスループットを約2倍向上。

効率的な推論

vLLMバックエンドによる効率的なデプロイをサポートし、大規模言語モデルの推論性能を最適化。

高精度維持

量子化後も複数のベンチマークテストで元のモデルの99%以上の精度を維持。

モデル能力

テキスト生成

対話システム

コード生成

数学的推論

使用事例

対話システム

インテリジェントカスタマーサポート

効率的なインテリジェントカスタマーサポートシステム構築に使用され、ユーザークエリを処理。

マルチターン対話をサポートし、応答速度が速い。

コード生成

プログラミング支援

開発者がコードスニペットを生成したりプログラミング問題を解決するのを支援。

HumanEvalベンチマークでpass@1が85.8%を達成。

license: mit tags:

deepseek
int8
vllm
llmcompressor base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B library_name: transformers

DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

Model Overview

Model Architecture: Qwen2ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: INT8
- Activation quantization: INT8
Release Date: 2/5/2025
Version: 1.0
Model Developers: Neural Magic

Quantized version of DeepSeek-R1-Distill-Qwen-32B.

Model Optimizations

This model was obtained by quantizing the weights and activations of DeepSeek-R1-Distill-Qwen-32B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created with llm-compressor by running the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

# Load model
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    QuantizationModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head"],
        dampening_frac=0.01,
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w8a8
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on OpenLLM Leaderboard V1 and V2, using the following commands:

OpenLLM Leaderboard V1:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM Leaderboard V2:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

Accuracy

Category	Metric	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	Recovery
Reasoning	AIME 2024 (pass@1)	69.75	68.17	97.73%
	MATH-500 (pass@1)	95.09	94.98	99.88%
	GPQA Diamond (pass@1)	64.05	64.75	101.09%
	Average Score	76.3	75.97	99.57%
OpenLLM V1	ARC-Challenge (Acc-Norm, 25-shot)	64.59	64.08	99.2%
	GSM8K (Strict-Match, 5-shot)	82.71	83.85	101.4%
	HellaSwag (Acc-Norm, 10-shot)	83.80	83.66	99.8%
	MMLU (Acc, 5-shot)	81.12	80.94	99.8%
	TruthfulQA (MC2, 0-shot)	58.41	58.47	100.1%
	Winogrande (Acc, 5-shot)	76.40	76.01	99.5%
	Average Score	74.51	74.50	100.0%
OpenLLM V2	IFEval (Inst Level Strict Acc, 0-shot)	42.87	41.92	97.8%
	BBH (Acc-Norm, 3-shot)	57.96	58.20	100.4%
	Math-Hard (Exact-Match, 4-shot)	0.00	0.00	---
	GPQA (Acc-Norm, 0-shot)	26.95	28.80	106.9%
	MUSR (Acc-Norm, 0-shot)	43.95	43.95	100.0%
	MMLU-Pro (Acc, 5-shot)	49.82	49.14	98.6%
	Average Score	36.92	37.00	100.2%
Coding	HumanEval (pass@1)	86.00	85.80	99.8%
	HumanEval (pass@10)	92.50	93.00	100.5%
	HumanEval+ (pass@10)	82.00	81.80	99.8%
	HumanEval+ (pass@10)	88.70	89.40	100.8%

Inference Performance

This model achieves up to 1.8x speedup in single-stream deployment and up to 2.2x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.

Benchmarking Command

guidellm --model neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server

Single-stream performance (measured with vLLM version 0.7.2)

				Instruction Following 256 / 128		Multi-turn Chat 512 / 256		Docstring Generation 768 / 128		RAG 1024 / 128		Code Completion 256 / 1024		Code Fixing 1024 / 1024		Large Summarization 4096 / 512		Large RAG 10240 / 1536
GPU class	Number of GPUs	Model	Average cost reduction	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD	Latency (s)	QPD
A6000	2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	6.3	359	12.8	176	6.5	347	6.6	342	49.9	45	50.8	44	26.6	85	83.4	27
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.81	6.9	648	13.8	325	7.2	629	7.2	622	54.8	82	55.6	81	30.0	150	94.8	47
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	3.07	3.9	1168	7.8	580	4.3	1041	4.6	975	29.7	151	30.9	146	19.3	233	61.4	73
A100	1	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	5.6	361	11.1	180	5.7	350	5.8	347	44.0	46	44.7	45	23.6	85	73.7	27
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.50	3.7	547	7.3	275	3.8	536	3.8	528	29.0	69	29.5	68	15.7	128	53.1	38
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	2.30	2.2	894	4.5	449	2.4	831	2.5	798	17.4	116	18.0	112	10.5	191	49.5	41
H100	1	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	3.3	327	6.7	163	3.4	320	3.4	317	26.6	41	26.9	41	14.3	77	47.8	23
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic	1.52	2.2	503	4.3	252	2.2	490	2.3	485	17.3	63	17.5	63	9.5	116	33.4	33
	1	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.61	2.1	532	4.1	268	2.1	516	2.1	513	16.1	68	16.5	66	9.1	120	31.9	34

**Use case profiles: prompt tokens / generation tokens

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Multi-stream asynchronous performance (measured with vLLM version 0.7.2)

			Instruction Following 256 / 128		Multi-turn Chat 512 / 256		Docstring Generation 768 / 128		RAG 1024 / 128		Code Completion 256 / 1024		Code Fixing 1024 / 1024		Large Summarization 4096 / 512		Large RAG 10240 / 1536
Hardware	Model	Average cost reduction	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD	Maximum throughput (QPS)	QPD
A6000x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	6.2	13940	1.9	4348	2.7	6153	2.1	4778	0.6	1382	0.4	930	0.3	685	0.1	124
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.80	8.7	19492	4.2	9474	4.1	9290	3.0	6802	1.2	2734	0.9	1962	0.5	1177	0.1	254
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.30	5.9	13366	2.5	5733	2.4	5409	1.6	3525	1.2	2757	0.7	1663	0.3	676	0.1	214
A100x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	12.9	13016	5.8	5848	6.3	6348	5.1	5146	2.0	1988	1.5	1463	0.9	869	0.2	192
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8	1.52	21.4	21479	8.9	8948	10.6	10611	8.2	8197	3.0	3018	2.0	2054	1.2	1241	0.3	264
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.09	13.5	13568	6.5	6509	6.0	6075	4.7	4754	2.8	2790	1.6	1651	0.9	862	0.2	225
H100x2	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	---	25.5	14392	12.5	7035	14.0	7877	11.3	6364	3.6	2041	2.7	1549	1.9	1057	0.4	200
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic	1.46	46.7	25538	20.3	11082	23.3	12728	18.4	10049	5.3	2881	3.7	2097	2.6	1445	0.5	256
	neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16	1.23	36.9	20172	17.4	9500	18.0	9822	14.2	7755	5.3	2900	3.3	1867	2.3	1265	0.4	241