Llama-3.1-8B-Instruct-FP8 Open-source Language Model - Supports 128K Long-text Dialogue Comprehension and Analysis

Llama 3.1 8B Instruct FP8

Developed by nvidia

FP8 quantized version of Meta Llama 3.1 8B Instruct model, featuring an optimized transformer architecture autoregressive language model with 128K context length support.

Large Language Model

Transformers

#FP8 Quantized Inference #128K Long Context #TensorRT Optimization

Downloads 3,700

Release Time : 8/29/2024

Model Overview

This model is the FP8 quantized version of Meta Llama 3.1 8B Instruct, optimized for TensorRT-LLM and vLLM inference, suitable for text generation tasks.

Model Features

FP8 Quantization

Reduces model disk size and GPU memory requirements by approximately 50% with FP8 quantization technology, achieving 1.3x speedup on H100.

Long Context Support

Supports 128K context length, ideal for long-text processing tasks.

High-Performance Inference

Optimized for TensorRT-LLM and vLLM, delivering efficient inference performance.

Model Capabilities

Text Generation

Long Text Processing

Instruction Following

Use Cases

Content Generation

Article Continuation

Generates coherent article content based on given prompts

Dialogue Systems

Builds intelligent conversational assistants

Education

Problem-Solving Assistance

Helps solve problems in subjects like math and science

Achieves 83.1% accuracy on GSM8K dataset

🚀 NVIDIA Llama 3.1 8B Instruct FP8 Model

The NVIDIA Llama 3.1 8B Instruct FP8 model is a quantized version of Meta's Llama 3.1 8B Instruct model, offering efficient text generation capabilities.

Metadata

Property	Details
Base Model	meta-llama/Llama-3.1-8B-Instruct
License	llama3.1
Pipeline Tag	text-generation
Library Name	transformers

🚀 Quick Start

The NVIDIA Llama 3.1 8B Instruct FP8 model is the quantized version of the Meta's Llama 3.1 8B Instruct model, which is an auto - regressive language model using an optimized transformer architecture. For more information, check here. It is quantized with TensorRT Model Optimizer. This model is ready for both commercial and non - commercial use.

✨ Features

Quantized Model: Reduces disk size and GPU memory requirements by approximately 50% through post - training quantization to FP8.
Multiple Runtime Support: Compatible with Tensor(RT) - LLM and vLLM for inference.
Wide Hardware Compatibility: Supports NVIDIA Blackwell, Hopper, and Lovelace microarchitectures.

📦 Installation

Deploy with TensorRT - LLM

To deploy the quantized checkpoint with TensorRT - LLM, follow these steps:

Checkpoint convertion:

python examples/llama/convert_checkpoint.py --model_dir Llama-3.1-8B-Instruct-FP8 --output_dir /ckpt --use_fp8

Build engines:

trtllm-build --checkpoint_dir /ckpt --output_dir /engine

Throughputs evaluation: Refer to the TensorRT - LLM benchmarking documentation for details.

Deploy with vLLM

To deploy the quantized checkpoint with vLLM, follow these instructions:

Install vLLM from directions here.
When using a Model Optimizer PTQ checkpoint with vLLM, pass the quantization = modelopt flag into the config while initializing the LLM Engine.

💻 Usage Examples

Basic Usage

# Example deployment on H100 with vLLM
from vllm import LLM, SamplingParams

model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

llm = LLM(model=model_id, quantization="modelopt")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

📚 Documentation

Model Overview

Third - Party Community Consideration

This model is not owned or developed by NVIDIA. It has been developed and built to a third - party’s requirements for this application and use case. See the link to the Non - NVIDIA (Meta - Llama - 3.1 - 8B - Instruct) Model Card.

License/Terms of Use

Model Architecture

Architecture Type: Transformers
Network Architecture: Llama3.1

Input

Input Type(s): Text
Input Format(s): String
Input Parameters: Sequences
Other Properties Related to Input: Context length up to 128K

Output

Output Type(s): Text
Output Format: String
Output Parameters: Sequences
Other Properties Related to Output: N/A

Software Integration

Supported Runtime Engine(s):
- Tensor(RT) - LLM
- vLLM
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Lovelace
Preferred Operating System(s):
- Linux

Model Version(s)

The model is quantized with nvidia - modelopt v0.27.0

Datasets

Calibration Dataset: cnn_dailymail
Evaluation Dataset: MMLU

Inference

Engine: Tensor(RT) - LLM or vLLM
Test Hardware: H100

Post Training Quantization

This model was obtained by quantizing the weights and activations of Meta - Llama - 3.1 - 8B - Instruct to FP8 data type, ready for inference with TensorRT - LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, a 1.3x speedup was achieved.

Evaluation

Precision	MMLU	GSM8K (CoT)	ARC Challenge	IFEVAL	TPS
BF16	69.4	84.5	83.4	80.4	8,579.93
FP8	68.7	83.1	83.3	81.8	11,062.90

We benchmarked with tensorrt - llm v0.13 on 8 H100 GPUs, using a batch size of 1024 for throughputs with in - flight batching enabled. An approximately ~1.3x speedup was achieved with FP8.

Deploy with vLLM

This model can be deployed with an OpenAI Compatible Server via the vLLM backend. See instructions here.

📄 License

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご