đ DeepSeek R1 AWQ
AWQ implementation of DeepSeek R1, offering optimized text generation capabilities.
This project presents the AWQ (Activation-aware Weight Quantization) version of the DeepSeek R1 model. It has been quantized by Eric Hartford and v2ray. A modification has been made to some of the model code to resolve an overflow issue when using float16.
đ Quick Start
To serve the model using vLLM with 8x 80GB GPUs, execute the following command:
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ
You can download the wheel built for PyTorch 2.6 and Python 3.12 by clicking here. The benchmark below was conducted using this wheel, which includes 2 PR merges and an unoptimized FlashMLA (still faster than Triton) for A100, significantly enhancing performance. The vLLM repo containing A100 FlashMLA can be found at LagPixelLOL/vllm@sm80_flashmla, a fork of vllm-project/vllm. The A100 FlashMLA it uses is based on LagPixelLOL/FlashMLA@vllm, a fork of pzhao-eng/FlashMLA.
đģ Usage Examples
Basic Usage
The basic usage involves running the provided command to serve the model with vLLM:
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ
Advanced Usage
For advanced usage, you can download the specific wheel for PyTorch 2.6 and Python 3.12 and use it for benchmarking or other tasks:
đ Documentation
TPS Per Request
Property |
Details |
Model Type |
AWQ of DeepSeek R1 |
Base Model |
deepseek-ai/DeepSeek-R1 |
Pipeline Tag |
text-generation |
Library Name |
transformers |
TPS Benchmark Table
GPU \ Batch Input Output |
B: 1 I: 2 O: 2K |
B: 32 I: 4K O: 256 |
B: 1 I: 63K O: 2K |
Prefill |
8x H100/H200 |
61.5 |
30.1 |
54.3 |
4732.2 |
4x H200 |
58.4 |
19.8 |
53.7 |
2653.1 |
8x A100 80GB |
46.8 |
12.8 |
30.4 |
2442.4 |
8x L40S |
46.3 |
OOM |
OOM |
688.5 |
Notes
â ī¸ Important Note
- The A100 config uses an unoptimized FlashMLA implementation, which is only superior to Triton during high context inference. It would be faster if optimized.
- The L40S config doesn't support FlashMLA, so the Triton implementation is used. This makes it extremely slow with high context. Also, the L40S has limited VRAM and lacks fast GPU to GPU interconnection bandwidth, making it even slower. It is not recommended to serve with this config, and you must limit the context to <= 4096,
--gpu-memory-utilization
to 0.98, and --max-num-seqs
to 4.
- All types of GPU used during benchmark are SXM form factor except L40S.
- Inference speed will be better than FP8 at low batch size but worse than FP8 at high batch size. This is the nature of low bit quantization.
- vLLM supports MLA for AWQ now, allowing you to run this model with full context length on just 8x 80GB GPUs.
đ License
This project is licensed under the MIT license.