DeepSeek-R1-quantized.w4a16 Open Source Model - Reduces Memory Space Requirement by 50% While Retaining Original Performance

Deepseek R1 Quantized.w4a16

Developed by RedHatAI

INT4 weight-quantized version of DeepSeek-R1, reducing GPU memory and disk space requirements by approximately 50% while maintaining original model performance.

Large Language Model

Safetensors

Open Source License:MIT #INT4 quantization #Efficient inference #Large language model

Downloads 119

Release Time : 4/17/2025

Model Overview

This model is a weight-quantized version of DeepSeek-R1, reducing weights from 8-bit to 4-bit, significantly lowering resource requirements while preserving the original model's performance. Suitable for large language model applications requiring efficient deployment.

Model Features

INT4 weight quantization

Reduces weights from 8-bit to 4-bit, decreasing GPU memory and disk space requirements by approximately 50%

Efficient deployment

Supports efficient deployment using vLLM backend, suitable for large-scale production environments

Performance retention

Maintains performance close to the original model after quantization

Model Capabilities

Text generation

Language understanding

Reasoning task processing

Use Cases

Education

Math problem solving

Solving complex math problems

Achieved 97.08% accuracy on MATH-500 test

Professional testing

AIME test

Handling American Invitational Mathematics Examination level problems

Achieved 77.00% accuracy on AIME 2024 test

General knowledge Q&A

MMLU test

Handling multidisciplinary multiple-choice questions

Achieved 86.99% accuracy on MMLU test

🚀 DeepSeek-R1-quantized.w4a16

This is a quantized model based on DeepSeek-R1, which can efficiently perform text generation tasks and reduce GPU memory and disk space requirements.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. Here is a simple example:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/DeepSeek-R1-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Model Overview

Model Architecture: DeepseekV3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
Release Date: 04/15/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of DeepSeek-R1 to INT4 data type. This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%.

📚 Documentation

Evaluation

The model was evaluated on the OpenLLM leaderboard task (v1) via lm-evaluation-harness, and on popular reasoning tasks (AIME 2024, MATH-500, GPQA-Diamond) via LightEval. For reasoning evaluations, we estimate pass@1 based on 10 runs with different seeds.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/DeepSeek-R1-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

Reasoning Benchmarks

export MODEL_ARGS="pretrained=RedHatAI/DeepSeek-R1-quantized.w4a16,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":42}"
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Accuracy

	Recovery (%)	deepseek/DeepSeek-R1	RedHatAI/DeepSeek-R1-quantized.w4a16 (this model)
ARC-Challenge 25-shot	100.00	72.53	72.53
GSM8k 5-shot	99.76	95.91	95.68
HellaSwag 10-shot	100.07	89.30	89.36
MMLU 5-shot	99.74	87.22	86.99
TruthfulQA 0-shot	100.83	59.28	59.77
WinoGrande 5-shot	101.65	82.00	83.35
OpenLLM v1 Average Score	100.30	81.04	81.28
AIME 2024 pass@1	98.30	78.33	77.00
MATH-500 pass@1	99.84	97.24	97.08
GPQA Diamond pass@1	98.01	73.38	71.92
Reasoning Average Score	98.81	82.99	82.00

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご