Model Overview
Model Features
Model Capabilities
Use Cases
đ Quantized Llama 3.1 Nemotron 70B Model
This project provides a 4-bit quantized version of the Llama 3.1 Nemotron 70B model, enabling efficient inference with reduced memory requirements. It supports multiple languages and offers various usage methods.
đ Quick Start
This repository contains an AWQ 4-bit quantized version of the nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
model. This model is an NVIDIA customized version of meta-llama/Meta-Llama-3.1-70B-Instruct
, originally released by Meta AI.
⨠Features
- Multi - language Support: Supports languages such as English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- Quantized Model: Quantized from FP16 to INT4 using AutoAWQ, reducing memory usage.
- Multiple Usage Methods: Compatible with
transformers
,autoawq
,text - generation - inference
, andvLLM
.
đĻ Installation
Transformers
To run the inference with Llama 3.1 Nemotron 70B Instruct AWQ in INT4, you need to install the following packages:
pip install -q --upgrade transformers autoawq accelerate
AutoAWQ
The same installation command is used as for Transformers:
pip install -q --upgrade transformers autoawq accelerate
Text Generation Inference (TGI)
First, install the necessary Python package and log in to the Hugging Face Hub:
pip install -q --upgrade huggingface_hub
huggingface-cli login
vLLM
You need to have Docker installed.
đģ Usage Examples
Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quantization_config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Note: Update this as per your use - case
do_fuse=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
quantization_config=quantization_config
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])
AutoAWQ
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])
Text Generation Inference (TGI)
Run the TGI Docker container:
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
-v hf_cache:/data \
-e MODEL_ID=ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
-e NUM_SHARD=4 \
-e QUANTIZE=awq \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e MAX_INPUT_LENGTH=4000 \
-e MAX_TOTAL_TOKENS=4096 \
ghcr.io/huggingface/text-generation-inference:2.2.0
Send a request to the deployed TGI endpoint:
curl 0.0.0.0:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
vLLM
Run the vLLM Docker container:
docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
-v hf_cache:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
--tensor-parallel-size 4 \
--max-model-len 4096
Send a request to the deployed vLLM endpoint:
curl 0.0.0.0:8000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
đ Documentation
Original Model Information
Llama - 3.1 - Nemotron - 70B - Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and [GPT - 4 - Turbo MT - Bench](https://github.com/lm - sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot - arena - leaderboard).
As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT - 4o and Claude 3.5 Sonnet.
As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on ChatBot Arena leaderboard.
The original model was trained using RLHF (specifically, REINFORCE), [Llama - 3.1 - Nemotron - 70B - Reward](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Reward) and HelpSteer2 - Preference prompts on a Llama - 3.1 - 70B - Instruct model as the initial policy.
[nvidia/Llama - 3.1 - Nemotron - 70B - Instruct - HF](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instr) has been converted from [Llama - 3.1 - Nemotron - 70B - Instruct](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instruct) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama - 3.1 - Nemotron - 70B - Instruct](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instruct) as evaluated in NeMo - Aligner, which the evaluation results are based on.
Quantization Reproduction Information
To quantize Llama 3.1 Nemotron 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.
First, install the following packages:
pip install -q --upgrade transformers autoawq accelerate
The quantization was produced using a single node with an Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090 (24GB VRAM each, for a total of 48 GB VRAM).
Adapted from the following sources:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
# Empty Cache
torch.cuda.empty_cache()
# Memory Limits - Set this according to your hardware limits
max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}
model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
quant_path = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Load model - Note: while this loads the layers into the CPU, the GPUs (and the VRAM) are still required for quantization! (Verified with nvida-smi)
model = AutoAWQForCausalLM.from_pretrained(
model_path,
use_cache=False,
max_memory=max_memory,
device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
model.quantize(
tokenizer,
quant_config=quant_config
)
# Save quantized model
model.save_quant
đ§ Technical Details
This model was quantized using [AutoAWQ](https://github.com/casper - hansen/AutoAWQ) from FP16 down to INT4 using GEMM kernels, with zero - point quantization and a group size of 128.
Hardware used: Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090. This should work on any platform that supports Llama 3.1 70B Instruct AWQ INT4.
đ License
The license for this model is llama3.1.
â ī¸ Important Note
This repository is an AWQ 4 - bit quantized version of the
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
model. Note from Terrell: Quantization to AWQ 4 - bit will further affect evaluation results.
â ī¸ Important Note
In order to run inference with Llama 3.1 Nemotron 70B Instruct AWQ in INT4, around 35 GiB of VRAM are needed for loading the model checkpoint, without including the KV cache or the CUDA graphs, meaning that there should be a bit over that VRAM available.
â ī¸ Important Note
In order to quantize Llama 3.1 Nemotron 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.
Property | Details |
---|---|
Model Type | AWQ 4 - bit quantized version of Llama - 3.1 - Nemotron - 70B - Instruct - HF |
Training Data | nvidia/HelpSteer2 |
Base Model | nvidia/Llama - 3.1 - Nemotron - 70B - Instruct - HF |
Library Name | transformers |
Pipeline Tag | text - generation |
Tags | llama - 3.1, meta, autoawq |
Supported Languages | en, de, fr, it, pt, hi, es, th |

