DeepSeek-R1-0528-AWQ Open-Source Model – Supports Multi-GPU for Efficient Operation across Full Context Lengths

Home

Deepseek R1 0528 AWQ

Developed by cognitivecomputations

AWQ-quantized version of DeepSeek R1 0528, supports full-context-length operation on 8x80GB GPUs using vLLM.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Long-context reasoning #AWQ quantization #Multi-GPU parallel

Downloads 145

Release Time : 6/1/2025

Model Overview

This is an AWQ-quantized version of DeepSeek-R1-0528 model, fixing overflow issues with float16 and optimizing runtime efficiency under vLLM framework.

Model Features

AWQ Quantization Optimization

Modified model code to fix float16 overflow issues and improve runtime efficiency.

Full Context Length Support

Supports full-context-length operation on 8x80GB GPUs using vLLM.

High-performance Inference

Optimized FlashMLA implementation for A100 GPUs, outperforming Triton in high-context reasoning.

Model Capabilities

Text generation

Long-text processing

Multilingual support

Use Cases

Text generation

Long-text generation

Supports text generation tasks with up to 63K input and 2K output.

Achieves 54.3 TPS on 8x H100/H200 configuration

Batch processing

Supports batch processing of 32 requests with 4K input and 256 output each.

Achieves 30.1 TPS on 8x H100/H200 configuration

🚀 DeepSeek-R1-0528-AWQ

AWQ quantified version of DeepSeek R1 0528. This project was written by [Eric Hartford]( https://huggingface.co/ehartford and [v2ray]( https://huggingface.co/v2ray Quantification is completed. This quantification is calculated by [Hot Aisle]( https://hotaisle.xyz/ Thank you for your generous support of the community! This quantitative release modifies some of the model code to fix an overflow issue when using float16.

🚀 Quick Start

To deploy the model using vLLM and eight 80GB GPUs, use the following command:

VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-0324-AWQ

You can click [here]. https://huggingface.co/x2ray/wheels/resolve/main/vllm-0.8.3.dev250%2Bg10afedcfd.cu128 -cp312-cp312-linux_x86_64.whl) Download the wheel file I built for PyTorch 2.6 and Python 3.12. The benchmark test below was done using this wheel file, which contains [2 PR merges]( https://github.com/vllm-project/vllm/issues?q=is%3Apr +is%3Aopen+author%3Ajinzhen-lin) and an unoptimized FlashMLA for the A100 (still faster than Triton), which greatly improves performance. The vLLM repository containing the A100 FlashMLA is available at [LagPixelLOL/ vllm@sm80_flashmla [ edit ] https://github.com/LagPixelLOL/vllm/tree/sm80_flashmla Find [vllm-project/vllm]( https://github.com/vllm-project/vllm a branch). It uses A100 FlashMLA based on [LagPixelLOL/ FlashMLA@vllm [ edit ] https://github.com/LagPixelLOL/FlashMLA/tree/vllm This is [pzhao-eng/FlashMLA]( https://github.com/pzhao-eng/FlashMLA (One branch.

💻 Usage Examples

Basic Usage

# Deployment model using vLLM and 8 80GB GPUs
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-0324-AWQ

📚 Detailed Documentation

Number of tokens per second (TPS) per request | GPU Batch Input Output | B: 1 I: 2 O: 2K | B: 32 I: 4K O: 256 | B: 1 I: 63K O: 2K | Pre-filled | | | 8x H100/H200 | 61.5 | 30.1 | 54.3 | 4732.2 | | 4x H200 | 58.4 | 19.8 | 53.7 | 2653.1 | | 8x A100 80GB | 46.8 | 12.8 | 30.4 | 2442.4 | 8x L40S | 46.3 | Memory Overflow (OOM) | Memory Overflow (OOM) | 688.5 |

Attention

Home ⚠️ ** Important Note** Home

The A100 configuration uses an unoptimized FlashMLA implementation, which is better than Triton in high context reasoning only, and is faster if optimized.

The L40S configuration does not support FlashMLA, so the Triton implementation is used, which makes it extremely slow in high context situations. And the L40S does not have much memory to handle too many contexts, and it does not have fast interconnect bandwidth between GPUs, which further reduces the speed. This configuration is not recommended for service, as you must limit the context to <= 4096, --gpu-memory-utilization to 0.98, --max-num-seqs to 4.

Except for the L40S, all types of GPUs used in the benchmarking are SXM specifications.

Reasoning speeds are better than FP8 at low batch sizes, but worse than FP8 at high batch sizes, which is a feature of low bit quantization. vLLM now supports AWQ's MLA, so you can run this model at full context length with only 8 80GB GPUs.

📄 License

This project is licensed under MIT.

📦 Model Information

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご