DeepSeek-R1-AWQ Open Source Model - Solve float16 Overflow and Support Efficient Inference Deployment

Deepseek R1 AWQ

Developed by cognitivecomputations

AWQ quantized version of DeepSeek R1 model, optimized for float16 overflow issues and supports efficient inference deployment

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Long-context reasoning #Multi-GPU efficient deployment #Bilingual generation (Chinese-English)

Downloads 30.46k

Release Time : 1/21/2025

Model Overview

AWQ quantized version based on DeepSeek-R1 foundation model, suitable for text generation tasks with bilingual (Chinese-English) processing capabilities

Model Features

Efficient Quantization

Utilizes AWQ quantization technology to significantly reduce computational resource requirements while maintaining model performance

Overflow Fix

Modified model code to fix overflow issues when using float16

High-performance Deployment

Supports efficient deployment via vLLM with performance benchmarks across various GPU configurations

Model Capabilities

Text generation

Bilingual processing (Chinese-English)

Long-context reasoning

Use Cases

Text generation

Content creation

Generate various types of textual content

Dialogue systems

Build intelligent conversational agents

🚀 DeepSeek R1 AWQ

AWQ implementation of DeepSeek R1, offering optimized text generation capabilities.

This project presents the AWQ (Activation-aware Weight Quantization) version of the DeepSeek R1 model. It has been quantized by Eric Hartford and v2ray. A modification has been made to some of the model code to resolve an overflow issue when using float16.

🚀 Quick Start

To serve the model using vLLM with 8x 80GB GPUs, execute the following command:

VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ

You can download the wheel built for PyTorch 2.6 and Python 3.12 by clicking here. The benchmark below was conducted using this wheel, which includes 2 PR merges and an unoptimized FlashMLA (still faster than Triton) for A100, significantly enhancing performance. The vLLM repo containing A100 FlashMLA can be found at LagPixelLOL/vllm@sm80_flashmla, a fork of vllm-project/vllm. The A100 FlashMLA it uses is based on LagPixelLOL/FlashMLA@vllm, a fork of pzhao-eng/FlashMLA.

💻 Usage Examples

Basic Usage

The basic usage involves running the provided command to serve the model with vLLM:

VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ

Advanced Usage

For advanced usage, you can download the specific wheel for PyTorch 2.6 and Python 3.12 and use it for benchmarking or other tasks:

# Download the wheel from the provided link
# Use it for benchmarking or other operations

📚 Documentation

TPS Per Request

Property	Details
Model Type	AWQ of DeepSeek R1
Base Model	deepseek-ai/DeepSeek-R1
Pipeline Tag	text-generation
Library Name	transformers

TPS Benchmark Table

GPU \ Batch Input Output	B: 1 I: 2 O: 2K	B: 32 I: 4K O: 256	B: 1 I: 63K O: 2K	Prefill
8x H100/H200	61.5	30.1	54.3	4732.2
4x H200	58.4	19.8	53.7	2653.1
8x A100 80GB	46.8	12.8	30.4	2442.4
8x L40S	46.3	OOM	OOM	688.5

Notes

⚠️ Important Note

The A100 config uses an unoptimized FlashMLA implementation, which is only superior to Triton during high context inference. It would be faster if optimized.

The L40S config doesn't support FlashMLA, so the Triton implementation is used. This makes it extremely slow with high context. Also, the L40S has limited VRAM and lacks fast GPU to GPU interconnection bandwidth, making it even slower. It is not recommended to serve with this config, and you must limit the context to <= 4096, --gpu-memory-utilization to 0.98, and --max-num-seqs to 4.

All types of GPU used during benchmark are SXM form factor except L40S.

Inference speed will be better than FP8 at low batch size but worse than FP8 at high batch size. This is the nature of low bit quantization.

vLLM supports MLA for AWQ now, allowing you to run this model with full context length on just 8x 80GB GPUs.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご