NVIDIA Llama 3.1 405B Instruct FP8 Open-source Language Model - Suitable for Both Commercial and Non-commercial Scenarios

Llama 3.1 405B Instruct FP8

Developed by nvidia

The NVIDIA Llama 3.1 405B Instruct FP8 model is a quantized version of Meta's Llama 3.1 405B Instruct model. It uses an optimized Transformer architecture and is an autoregressive language model. This model can be used for commercial or non-commercial purposes.

Large Language Model

Transformers

#FP8 Quantization Optimization #Support for Ultra-Long Contexts #Commercially Available Large Model

Downloads 10.91k

Release Time : 8/29/2024

Model Overview

This model is the FP8 quantized version of Meta-Llama-3.1-405B-Instruct. By reducing the disk size and GPU memory requirements, it achieves a 1.7x speedup on the H200. It supports two inference engines, TensorRT-LLM and vLLM.

Model Features

FP8 Quantization Optimization

By quantizing weights and activations to the FP8 data type, it reduces the disk size and GPU memory requirements, achieving a 1.7x speedup on the H200.

Multi-Platform Support

It supports two inference engines, Tensor(RT)-LLM and vLLM, and hardware microarchitectures such as NVIDIA Blackwell, NVIDIA Hopper, and NVIDIA Lovelace.

Commercially Available

This model can be used for commercial or non-commercial purposes.

High Performance

It performs excellently in benchmark tests such as MMLU, GSM8K (CoT), and ARC Challenge.

Model Capabilities

Text Generation

Language Understanding

Question-Answering System

Content Creation

Use Cases

General Text Generation

Content Continuation

Generate coherent subsequent content based on a given text fragment.

Generate smooth and coherent text

Question-Answering System

Answer various questions raised by users.

Accurately answer various types of questions

Education

Mathematical Problem Solving

Solve complex mathematical problems.

Achieve an accuracy of 96.2% in the GSM8K (CoT) test

🚀 NVIDIA Llama 3.1 405B Instruct FP8 Model

The NVIDIA Llama 3.1 405B Instruct FP8 model is a quantized version of Meta's Llama 3.1 405B Instruct model, offering efficient text generation capabilities.

🚀 Quick Start

The NVIDIA Llama 3.1 405B Instruct FP8 model is a quantized auto - regressive language model using an optimized transformer architecture. For more details, visit here.

✨ Features

Quantization: Quantized with TensorRT Model Optimizer to FP8 data type, reducing disk size and GPU memory requirements by about 50%.
High - Performance: Achieved 1.7x speedup on H200.
Multiple Deployment Options: Can be deployed with TensorRT - LLM or vLLM.

📦 Installation

This section mainly focuses on deployment rather than traditional installation. For deployment, you need to follow the steps in the "Usage" section.

💻 Usage Examples

Deploy with TensorRT - LLM

To deploy the quantized checkpoint with TensorRT - LLM, follow these steps:

Checkpoint convertion

python examples/llama/convert_checkpoint.py --model_dir Llama-3.1-405B-Instruct-FP8 --output_dir /ckpt --use_fp8

Build engines

trtllm-build --checkpoint_dir /ckpt --output_dir /engine

Throughputs evaluation

Refer to the TensorRT - LLM benchmarking documentation for details.

Evaluation

Precision	MMLU	GSM8K (CoT)	ARC Challenge	IFEVAL	TPS
BF16	87.3	96.8	96.9	88.6	275.0
FP8	87.4	96.2	96.4	90.4	469.78

We benchmarked with tensorrt - llm v0.13 on 8 H200 GPUs, using batch size 1024 for the throughputs with in - flight batching enabled. We achieved ~1.7x speedup with FP8.

Deploy with vLLM

To deploy the quantized checkpoint with vLLM, follow these steps:

Install vLLM from directions here.
Use the following Python code as an example:

from vllm import LLM, SamplingParams

model_id = "nvidia/Llama-3.1-405B-Instruct-FP8"
tp_size = 8 #use the required number of gpus based on your GPU Memory.
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
max_model_len = 8192

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

llm = LLM(model=model_id, quantization='modelopt', tensor_parallel_size=tp_size, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions here.

📚 Documentation

Third - Party Community Consideration

This model is not owned or developed by NVIDIA. It has been developed and built to a third - party’s requirements for this application and use case. See the link to the Non - NVIDIA (Meta - Llama - 3.1 - 405B - Instruct) Model Card.

License/Terms of Use

Model Architecture

Property	Details
Architecture Type	Transformers
Network Architecture	Llama3.1

Input

Property	Details
Input Type(s)	Text
Input Format(s)	String
Input Parameters	Sequences
Other Properties Related to Input	Context length up to 128K

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	Sequences
Other Properties Related to Output	N/A

Software Integration

Property	Details
Supported Runtime Engine(s)	Tensor(RT)-LLM, vLLM
Supported Hardware Microarchitecture Compatibility	NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
Preferred Operating System(s)	Linux

Model Version(s)

The model is quantized with nvidia - modelopt v0.15.1

Datasets

Property	Details
Calibration Dataset	cnn_dailymail
Evaluation Dataset	MMLU

Inference

Property	Details
Engine	Tensor(RT)-LLM or vLLM
Test Hardware	H200

Post Training Quantization

This model was obtained by quantizing the weights and activations of Meta - Llama - 3.1 - 405B - Instruct to FP8 data type, ready for inference with TensorRT - LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H200, we achieved 1.7x speedup.

📄 License

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご