๐ NVIDIA Llama 3.1 405B Instruct FP8 Model
The NVIDIA Llama 3.1 405B Instruct FP8 model is a quantized version of Meta's Llama 3.1 405B Instruct model, offering efficient text generation capabilities.
๐ Quick Start
The NVIDIA Llama 3.1 405B Instruct FP8 model is a quantized auto - regressive language model using an optimized transformer architecture. For more details, visit here.
โจ Features
- Quantization: Quantized with TensorRT Model Optimizer to FP8 data type, reducing disk size and GPU memory requirements by about 50%.
- High - Performance: Achieved 1.7x speedup on H200.
- Multiple Deployment Options: Can be deployed with TensorRT - LLM or vLLM.
๐ฆ Installation
This section mainly focuses on deployment rather than traditional installation. For deployment, you need to follow the steps in the "Usage" section.
๐ป Usage Examples
Deploy with TensorRT - LLM
To deploy the quantized checkpoint with TensorRT - LLM, follow these steps:
Checkpoint convertion
python examples/llama/convert_checkpoint.py --model_dir Llama-3.1-405B-Instruct-FP8 --output_dir /ckpt --use_fp8
Build engines
trtllm-build --checkpoint_dir /ckpt --output_dir /engine
Throughputs evaluation
Refer to the TensorRT - LLM benchmarking documentation for details.
Evaluation
Precision |
MMLU |
GSM8K (CoT) |
ARC Challenge |
IFEVAL |
TPS |
BF16 |
87.3 |
96.8 |
96.9 |
88.6 |
275.0 |
FP8 |
87.4 |
96.2 |
96.4 |
90.4 |
469.78 |
We benchmarked with tensorrt - llm v0.13 on 8 H200 GPUs, using batch size 1024 for the throughputs with in - flight batching enabled. We achieved ~1.7x speedup with FP8.
Deploy with vLLM
To deploy the quantized checkpoint with vLLM, follow these steps:
- Install vLLM from directions here.
- Use the following Python code as an example:
from vllm import LLM, SamplingParams
model_id = "nvidia/Llama-3.1-405B-Instruct-FP8"
tp_size = 8
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
max_model_len = 8192
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
llm = LLM(model=model_id, quantization='modelopt', tensor_parallel_size=tp_size, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions here.
๐ Documentation
Third - Party Community Consideration
This model is not owned or developed by NVIDIA. It has been developed and built to a third - partyโs requirements for this application and use case. See the link to the Non - NVIDIA (Meta - Llama - 3.1 - 405B - Instruct) Model Card.
License/Terms of Use
Model Architecture
Property |
Details |
Architecture Type |
Transformers |
Network Architecture |
Llama3.1 |
Input
Property |
Details |
Input Type(s) |
Text |
Input Format(s) |
String |
Input Parameters |
Sequences |
Other Properties Related to Input |
Context length up to 128K |
Output
Property |
Details |
Output Type(s) |
Text |
Output Format |
String |
Output Parameters |
Sequences |
Other Properties Related to Output |
N/A |
Software Integration
Property |
Details |
Supported Runtime Engine(s) |
Tensor(RT)-LLM, vLLM |
Supported Hardware Microarchitecture Compatibility |
NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace |
Preferred Operating System(s) |
Linux |
Model Version(s)
The model is quantized with nvidia - modelopt v0.15.1
Datasets
Inference
Property |
Details |
Engine |
Tensor(RT)-LLM or vLLM |
Test Hardware |
H200 |
Post Training Quantization
This model was obtained by quantizing the weights and activations of Meta - Llama - 3.1 - 405B - Instruct to FP8 data type, ready for inference with TensorRT - LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H200, we achieved 1.7x speedup.
๐ License