Llama-3_1-Nemotron-51B-Instruct Open-source Large Language Model - Balancing accuracy and efficiency, the top choice for commercial use!

Llama 3 1 Nemotron 51B Instruct

Developed by nvidia

Llama-3_1-Nemotron-51B-instruct is a large language model that achieves an excellent balance between model accuracy and efficiency, suitable for commercial use.

Large Language Model

Transformers

EnglishOpen Source License:Other #Efficient single-GPU inference #Optimized English chat #NAS architecture search

Downloads 65.87k

Release Time : 9/22/2024

Model Overview

This model reduces memory usage through a unique method and can handle high-load tasks on a single GPU. It is a general-purpose chat model suitable for English and programming languages, and also supports other non-English languages.

Model Features

Balance between efficiency and accuracy

Achieves an excellent balance between model accuracy and efficiency, offering high cost-performance.

Low memory usage

Significantly reduces the model's memory usage through a novel neural architecture search (NAS) method.

Single-GPU support

Can run at high load on a single H100 - 80GB GPU.

Knowledge distillation optimization

Optimized through knowledge distillation (KD) for English single-round and multi-round chat use cases.

Model Capabilities

Text generation

Multi-round dialogue

Code generation

Multi-language support

Use Cases

Chat applications

English chat

Supports English single-round and multi-round chat.

Meets human chat preferences.

Non-English chat

Supports chat in other non-English languages.

Coding assistance

Code generation

Supports the generation and assistance of programming languages.

🚀 Llama-3_1-Nemotron-51B-instruct

Llama-3_1-Nemotron-51B-instruct offers an excellent balance between model accuracy and efficiency, with a reduced memory footprint and high throughput, making it suitable for commercial use.

🚀 Quick Start

Our code requires the transformers package version to be 4.44.2 or higher.

See the snippet below for usage with transformers:

import torch
import transformers

model_id = "nvidia/Llama-3_1-Nemotron-51B-Instruct"
model_kwargs = {"torch_dtype": torch.bfloat16, "trust_remote_code": True, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
    "text-generation", 
    model=model_id, 
    tokenizer=tokenizer, 
    max_new_tokens=20, 
    **model_kwargs
)
print(pipeline([{"role": "user", "content": "Hey how are you?"}]))

✨ Features

Llama-3_1-Nemotron-51B-instruct offers a great tradeoff between model accuracy and efficiency, providing great ‘quality-per-dollar’.
Using a novel Neural Architecture Search (NAS) approach, it greatly reduces the model’s memory footprint, enabling larger workloads and fitting on a single GPU at high workloads (H100-80GB).
This model is ready for commercial use.

📦 Installation

No specific installation steps other than the transformers package requirement are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import transformers

model_id = "nvidia/Llama-3_1-Nemotron-51B-Instruct"
model_kwargs = {"torch_dtype": torch.bfloat16, "trust_remote_code": True, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
    "text-generation", 
    model=model_id, 
    tokenizer=tokenizer, 
    max_new_tokens=20, 
    **model_kwargs
)
print(pipeline([{"role": "user", "content": "Hey how are you?"}]))

Advanced Usage

There is no advanced usage example in the original document.

📚 Documentation

Model Overview

Llama-3_1-Nemotron-51B-instruct is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to price, providing great ‘quality-per-dollar’. Using a novel Neural Architecture Search (NAS) approach we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H100-80GB). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. This model is ready for commercial use.

How was the model developed

Llama-3_1-Nemotron-51B-instruct is a large language model (LLM) which is a derivative of Llama-3.1-70B-instruct (AKA the reference model). We utilize a block-wise distillation of the reference model, where for each block we create multiple variants providing different tradeoffs of quality vs. computational complexity. We then search over the blocks to create a model which meets the required throughput and memory (optimized for a single H100-80GB GPU) while minimizing the quality degradation. The model then undergoes knowledge distillation (KD), with a focus on English single and multi-turn chat use-cases. The KD step included 40 billion tokens consisting of a mixture of 3 datasets - FineWeb, Buzz-V1.2 and Dolma.

Links to NIM, blog and huggingface

This results in a final model that is aligned for human chat preferences.

Model Developers: NVIDIA Model Input: Text only Model Output: Text only Model Dates: Llama-3_1-Nemotron-51B-instruct was trained between August and September 2024 Data Freshness: The pretraining data has a cutoff of 2023 Sequence Length Used During Distillation: 8192

Required Hardware

FP8 Inference (recommended):

1x H100-80GB GPU

BF16 Inference:

2x H100-80GB GPUs
2x A100-80GB GPUs

Model Architecture

The model is a derivative of Llama-3.1-70B, using Neural Architecture Search (NAS). The NAS algorithm results in non-standard and non-repetitive blocks. This includes the following:

Variable Grouped Query Attention (VGQA) - each block can have a different number of KV (keys and values) heads, ranging from 1 to Llama’s typical 8.
Skip attention - in some blocks the attention is skipped entirely, or replaced with a single linear layer.
Variable FFN - the expansion/compression ratio in the FFN layer is different between blocks.

Architecture Type: Transformer Decoder (auto-regressive language model)

Software Integration

Runtime Engine(s):

NeMo 24.05

Supported Hardware Architecture Compatibility: NVIDIA H100, A100 80GB (BF16 quantization).

[Preferred/Supported] Operating System(s):

Linux

Intended use

Llama-3_1-Nemotron-51B-Instruct is a general purpose chat model intended to be used in English and coding languages. Other non-English languages are also supported.

Evaluation Results

Data Collection Method by dataset

Automated

MT-Bench

Evaluated using select datasets from the Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena MT-bench - 8.99

MMLU

Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding

MMLU (5-shot)
80.2%

GSM8K

Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in Training Verifiers to Solve Math Word Problems

GSM8K (5-shot)
91.43%

Winogrande

Winogrande (5-shot)
84.53%

Arc-C

Arc challenge (25-shot)
69.20%

Hellaswag

Hellaswag (10-shot)
85.58%

Truthful QA

TruthfulQA (0-shot)
58.63%%

Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs.

Adversarial Testing and Red Teaming Efforts

The Llama-3_1-Nemotron-51B-instruct model underwent extensive safety evaluation including adversarial testing via three distinct methods:

Garak, is an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage.
AEGIS, is a content safety evaluation dataset and LLM based content safety classifier model, that adheres to a broad taxonomy of 13 categories of critical risks in human-LLM interactions.
Human Content Red Teaming leveraging human interaction and evaluation of the models' responses.

Inference

Engine: Tensor(RT)
Test Hardware H100-80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

🔧 Technical Details

The model uses a novel Neural Architecture Search (NAS) approach to reduce the memory footprint and enable larger workloads. It also undergoes block-wise distillation and knowledge distillation (KD) to optimize the model.

📄 License

Your use of this model is governed by the NVIDIA Open Model License. Additional Information: Llama 3.1 Community License Agreement. Built with Llama.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご