DeepSeek-R1-0528-GPTQ Quantized Model Open-Sourced - Reduce File Size While Ensuring Generation Quality

Deepseek R1 0528 GPTQ Int4 Int8Mix Compact

Developed by QuantTrio

The GPTQ quantized version of the DeepSeek-R1-0528 model, using a quantization scheme of Int4 + selective Int8, which reduces the file size while ensuring the generation quality.

Large Language Model

Transformers

Open Source License:MIT #Mixed-precision quantization #Efficient inference #Long context processing

Downloads 258

Release Time : 6/1/2025

Model Overview

This model is a quantized version of DeepSeek-R1-0528. Through the mixed Int4 and Int8 quantization technology, it optimizes the model inference speed and video memory usage, and is suitable for deployment scenarios with different hardware configurations.

Model Features

Mixed quantization technology

Adopt a quantization scheme of Int4 + selective Int8. Only the quantization-sensitive layers use Int8, and the rest use Int4 to balance the generation quality and file size.

Multiple quantization variants

Provide three quantization variants: Lite, Compact, and Medium to adapt to different hardware configurations and quality requirements.

Optimized inference performance

Through layer-by-layer fine-grained quantization, it significantly alleviates the problem of decreased inference accuracy caused by pure Int4 quantization.

Enhanced reasoning ability

Compared with the previous version, it has significant improvements in handling complex reasoning tasks, such as mathematical problems and programming challenges.

Model Capabilities

Complex logical reasoning

Mathematical problem solving

Code generation and understanding

Long text generation

Multi-round dialogue

Use Cases

Education

Mathematical competition problem solving

Solve math competition problems such as AIME

The accuracy rate reached 87.5% in the AIME 2025 test

Programming teaching

Assist in programming learning and code debugging

The Pass@1 reached 73.3% in the LiveCodeBench test

Software development

Code generation

Generate high-quality code according to requirements

The solution rate reached 57.6% in the SWE Verified test

Code review

Analyze the code and provide improvement suggestions

Research

Academic Q&A

Answer complex academic questions

The Pass@1 reached 81.0% in the GPQA-Diamond test

🚀 DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact

This repository provides an Int4 + selectively-Int8 GPTQ DeepSeek-R1-0528 model, which maintains generation quality with minimal file-size overhead.

🚀 Quick Start

The base model is deepseek-ai/DeepSeek-R1-0528. This repository delivers an Int4 + selectively-Int8 GPTQ DeepSeek-R1-0528 model. Only layers that are highly sensitive to quantization remain in Int8, while the rest stay Int4, preserving generation quality with minimal file-size overhead.

Preliminary trials show that converting the entire model to pure Int4 (AWQ/GPTQ) under the quantization layout used in vLLM’s current DeepSeek-R1 implementation degrades inference accuracy and can produce faulty outputs. Layer-wise fine-grained quantization substantially mitigates this issue.

⚠️ Important Note

vLLM == 0.9.0 does not yet natively support per-layer quantization for MoE modules. We added get_moe_quant_method to gptq_marlin.py as an interim fix. Until the upstream PR is merged, please replace the original file with the one provided in this repo.

✨ Features

Variant Overview

Variant	Characteristics	File Size	Recommended Scenario
Lite	Only the most critical layers upgraded to Int8; size close to pure Int4	355 GB	Resource-constrained, lightweight server deployments
Compact	More Int8 layers, relatively higher output quality	414 GB	VRAM-sufficient deployments focused on answer quality (e.g., 8 × A100)
Medium	Compact plus fully-Int8 attention layers; high quality with reduced long-context loss	445 GB	VRAM-rich deployments needing both top answer quality and high concurrency (e.g., 8 × H20)

Choose the variant that best matches your hardware and quality requirements.

Model Update Date

2025-05-31
1. fast commit

Dependencies

vllm==0.9.0
transformers==4.52.3

⚠️ Important Note

Recommend Using V0 Inference Mode: Before launching vLLM, set the environment variable

export VLLM_USE_V1=0

📦 Installation

Model Download

from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact', cache_dir="local_path")

💻 Usage Examples

Base Model Information

Paper Link

Introduction

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528. In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro.

Compared to the previous version, the upgraded model shows significant improvements in handling complex reasoning tasks. For instance, in the AIME 2025 test, the model’s accuracy has increased from 70% in the previous version to 87.5% in the current version. This advancement stems from enhanced thinking depth during the reasoning process: in the AIME test set, the previous model used an average of 12K tokens per question, whereas the new version averages 23K tokens per question.

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

Evaluation Results

For all our models, the maximum generation length is set to 64K tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 16 responses per query to estimate pass@1.

Category	Benchmark (Metric)	DeepSeek R1	DeepSeek R1 0528
General
	MMLU-Redux (EM)	92.9	93.4
	MMLU-Pro (EM)	84.0	85.0
	GPQA-Diamond (Pass@1)	71.5	81.0
	SimpleQA (Correct)	30.1	27.8
	FRAMES (Acc.)	82.5	83.0
	Humanity's Last Exam (Pass@1)	8.5	17.7
Code
	LiveCodeBench (2408 - 2505) (Pass@1)	63.5	73.3
	Codeforces-Div1 (Rating)	1530	1930
	SWE Verified (Resolved)	49.2	57.6
	Aider-Polyglot (Acc.)	53.3	71.6
Math
	AIME 2024 (Pass@1)	79.8	91.4
	AIME 2025 (Pass@1)	70.0	87.5
	HMMT 2025 (Pass@1)	41.7	79.4
	CNMO 2024 (Pass@1)	78.8	86.9
Tools
	BFCL_v3_MultiTurn (Acc)	-	37.0
	Tau-Bench (Pass@1)	-	53.5(Airline)/63.9(Retail)

Note: We use Agentless framework to evaluate model performance on SWE-Verified. We only evaluate text-only prompts in HLE testsets. GPT-4.1 is employed to act user role in Tau-bench evaluation.

📚 Documentation

License

This code repository is licensed under MIT License. The use of DeepSeek-R1 models is also subject to MIT License. DeepSeek-R1 series (including Base and Chat) supports commercial use and distillation.

Citation

@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
      author={DeepSeek-AI},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948}, 
}

Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご