đ DeepSeek-R1-0528-AWQ
AWQ quantified version of DeepSeek R1 0528. This project was written by [Eric Hartford]( https://huggingface.co/ehartford and [v2ray]( https://huggingface.co/v2ray Quantification is completed. This quantification is calculated by [Hot Aisle]( https://hotaisle.xyz/ Thank you for your generous support of the community!
This quantitative release modifies some of the model code to fix an overflow issue when using float16.
đ Quick Start
To deploy the model using vLLM and eight 80GB GPUs, use the following command:
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-0324-AWQ
You can click [here]. https://huggingface.co/x2ray/wheels/resolve/main/vllm-0.8.3.dev250%2Bg10afedcfd.cu128 -cp312-cp312-linux_x86_64.whl) Download the wheel file I built for PyTorch 2.6 and Python 3.12. The benchmark test below was done using this wheel file, which contains [2 PR merges]( https://github.com/vllm-project/vllm/issues?q=is%3Apr +is%3Aopen+author%3Ajinzhen-lin) and an unoptimized FlashMLA for the A100 (still faster than Triton), which greatly improves performance. The vLLM repository containing the A100 FlashMLA is available at [LagPixelLOL/ vllm@sm80_flashmla [ edit ] https://github.com/LagPixelLOL/vllm/tree/sm80_flashmla Find [vllm-project/vllm]( https://github.com/vllm-project/vllm a branch). It uses A100 FlashMLA based on [LagPixelLOL/ FlashMLA@vllm [ edit ] https://github.com/LagPixelLOL/FlashMLA/tree/vllm This is [pzhao-eng/FlashMLA]( https://github.com/pzhao-eng/FlashMLA (One branch.
đģ Usage Examples
Basic Usage
VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-0324-AWQ
đ Detailed Documentation
Number of tokens per second (TPS) per request
| GPU Batch Input Output | B: 1 I: 2 O: 2K | B: 32 I: 4K O: 256 | B: 1 I: 63K O: 2K | Pre-filled |
|
| 8x H100/H200 | 61.5 | 30.1 | 54.3 | 4732.2 |
| 4x H200 | 58.4 | 19.8 | 53.7 | 2653.1 |
| 8x A100 80GB | 46.8 | 12.8 | 30.4 | 2442.4 |
8x L40S | 46.3 | Memory Overflow (OOM) | Memory Overflow (OOM) | 688.5 |
Attention
Home â ī¸ ** Important Note**
Home
- The A100 configuration uses an unoptimized FlashMLA implementation, which is better than Triton in high context reasoning only, and is faster if optimized.
- The L40S configuration does not support FlashMLA, so the Triton implementation is used, which makes it extremely slow in high context situations. And the L40S does not have much memory to handle too many contexts, and it does not have fast interconnect bandwidth between GPUs, which further reduces the speed. This configuration is not recommended for service, as you must limit the context to <= 4096,
--gpu-memory-utilization
to 0.98, --max-num-seqs
to 4.
- Except for the L40S, all types of GPUs used in the benchmarking are SXM specifications.
- Reasoning speeds are better than FP8 at low batch sizes, but worse than FP8 at high batch sizes, which is a feature of low bit quantization.
vLLM now supports AWQ's MLA, so you can run this model at full context length with only 8 80GB GPUs.
đ License
This project is licensed under MIT.
đĻ Model Information
| Properties | Details |
|
| Basic model | deepseek-ai/DeepSeek-R1-0528 |
| Task Type | Text Generation |
| Library name | transformers |