๐ NVIDIA DeepSeek R1 FP4 Model
The NVIDIA DeepSeek R1 FP4 model is a quantized version of DeepSeek AI's DeepSeek R1, an auto - regressive language model with an optimized transformer architecture. It's quantized using the [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT - Model - Optimizer). This model is available for both commercial and non - commercial use.
๐ Quick Start
Deploy with TensorRT - LLM
To deploy the quantized FP4 checkpoint with [TensorRT - LLM](https://github.com/NVIDIA/TensorRT - LLM) LLM API, use the following sample codes (requires 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(max_tokens=32)
llm = LLM(model="nvidia/DeepSeek - R1 - FP4", tensor_parallel_size=8, enable_attention_dp=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == '__main__':
main()
โจ Features
- Quantized Model: Quantized to FP4 data type, reducing disk size and GPU memory requirements.
- Commercial Use: Available for both commercial and non - commercial applications.
- High - Performance Runtime: Supported by Tensor(RT) - LLM runtime engine.
๐ฆ Installation
No specific installation steps provided in the original README.
๐ป Usage Examples
Basic Usage
The basic usage is demonstrated in the deployment section above, where you can generate text based on given prompts using TensorRT - LLM.
Evaluation
The accuracy benchmark results are presented in the table below:
Precision |
MMLU |
GSM8K |
AIME2024 |
GPQA Diamond |
MATH - 500 |
FP8 |
90.8 |
96.3 |
80.0 |
69.7 |
95.4 |
FP4 |
90.7 |
96.1 |
80.0 |
69.2 |
94.2 |
๐ Documentation
Model Overview
The NVIDIA DeepSeek R1 FP4 model is the quantized version of DeepSeek AI's DeepSeek R1. It uses an optimized transformer architecture. For more information, check [here](https://huggingface.co/deepseek - ai/DeepSeek - R1).
Third - Party Community Consideration
This model is not owned or developed by NVIDIA. It's developed and built to a third - partyโs requirements. See the link to the Non - NVIDIA [(DeepSeek R1) Model Card](https://huggingface.co/deepseek - ai/DeepSeek - R1).
Model Architecture
Property |
Details |
Model Type |
Transformers |
Network Architecture |
DeepSeek R1 |
Input
Property |
Details |
Input Type(s) |
Text |
Input Format(s) |
String |
Input Parameters |
1D (One Dimensional): Sequences |
Other Properties Related to Input |
Context length up to 128K |
Output
Property |
Details |
Output Type(s) |
Text |
Output Format |
String |
Output Parameters |
1D (One Dimensional): Sequences |
Other Properties Related to Output |
N/A |
Software Integration
Property |
Details |
Supported Runtime Engine(s) |
Tensor(RT) - LLM |
Supported Hardware Microarchitecture Compatibility |
NVIDIA Blackwell |
Preferred Operating System(s) |
Linux |
Model Version(s)
The model is quantized with nvidia - modelopt v0.23.0.
Datasets
Dataset Type |
Details |
Calibration Dataset |
cnn_dailymail, Data collection method: Automated, Labeling method: Unknown |
Evaluation Dataset |
MMLU, Data collection method: Unknown, Labeling method: N/A |
Inference
Property |
Details |
Engine |
Tensor(RT) - LLM |
Test Hardware |
B200 |
Post Training Quantization
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT - LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices for AI development. Developers should ensure this model meets industry requirements and address potential misuse. Report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).
๐ง Technical Details
The quantization process involves converting the weights and activations of the DeepSeek R1 model to FP4 data type. Only the linear operators within the transformers blocks have their weights and activations quantized. This reduces the bit - per - parameter from 8 to 4, resulting in a 1.6x reduction in disk size and GPU memory requirements.
๐ License
This model is released under the MIT license.