Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 Open Source Model - Free Deployment to Improve the Usefulness of Generated Responses

Nvidia Llama 3.1 Nemotron 70B Instruct HF AWQ INT4

Developed by ibnzterrell

This is NVIDIA's AWQ 4-bit quantized version of the Llama-3.1-Nemotron-70B-Instruct model, customized based on Meta's Llama-3.1-70B-Instruct, focusing on improving the usefulness of generated responses.

Large Language Model

Transformers

Supports Multiple Languages#Multilingual instruction optimization #70B parameter quantization #High-performance dialogue generation

Downloads 206

Release Time : 10/24/2024

Model Overview

This model is a large language model optimized to provide high-quality answers, supports multiple languages, and is suitable for text generation tasks.

Model Features

High-performance quantization

Quantized from FP16 to INT4 using AutoAWQ, employing GEMM kernels, zero-point quantization, and a group size of 128 to optimize inference efficiency.

Multilingual support

Supports multiple languages including English, German, French, Spanish, etc., suitable for international applications.

Reinforcement learning alignment

Uses RLHF and HelpSteer2-Preference prompts for reinforcement learning alignment training to enhance the usefulness of generated responses.

Model Capabilities

Text generation

Multilingual support

Dialogue systems

Use Cases

Dialogue systems

Intelligent customer service

Used to build multilingual intelligent customer service systems, providing high-quality responses.

Achieved 85.0 on Arena Hard and 57.6 on AlpacaEval 2 LC.

Content generation

Multilingual content creation

Generates high-quality multilingual text content suitable for news, blogs, etc.

🚀 Quantized Llama 3.1 Nemotron 70B Model

This project provides a 4-bit quantized version of the Llama 3.1 Nemotron 70B model, enabling efficient inference with reduced memory requirements. It supports multiple languages and offers various usage methods.

🚀 Quick Start

This repository contains an AWQ 4-bit quantized version of the nvidia/Llama-3.1-Nemotron-70B-Instruct-HF model. This model is an NVIDIA customized version of meta-llama/Meta-Llama-3.1-70B-Instruct, originally released by Meta AI.

✨ Features

Multi - language Support: Supports languages such as English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Quantized Model: Quantized from FP16 to INT4 using AutoAWQ, reducing memory usage.
Multiple Usage Methods: Compatible with transformers, autoawq, text - generation - inference, and vLLM.

📦 Installation

Transformers

To run the inference with Llama 3.1 Nemotron 70B Instruct AWQ in INT4, you need to install the following packages:

pip install -q --upgrade transformers autoawq accelerate

AutoAWQ

The same installation command is used as for Transformers:

pip install -q --upgrade transformers autoawq accelerate

Text Generation Inference (TGI)

First, install the necessary Python package and log in to the Hugging Face Hub:

pip install -q --upgrade huggingface_hub
huggingface-cli login

vLLM

You need to have Docker installed.

💻 Usage Examples

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use - case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

AutoAWQ

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

Text Generation Inference (TGI)

Run the TGI Docker container:

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  -e NUM_SHARD=4 \
  -e QUANTIZE=awq \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

Send a request to the deployed TGI endpoint:

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

vLLM

Run the vLLM Docker container:

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  --tensor-parallel-size 4 \
  --max-model-len 4096

Send a request to the deployed vLLM endpoint:

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

📚 Documentation

Original Model Information

Llama - 3.1 - Nemotron - 70B - Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and [GPT - 4 - Turbo MT - Bench](https://github.com/lm - sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot - arena - leaderboard).

As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT - 4o and Claude 3.5 Sonnet.

As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on ChatBot Arena leaderboard.

The original model was trained using RLHF (specifically, REINFORCE), [Llama - 3.1 - Nemotron - 70B - Reward](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Reward) and HelpSteer2 - Preference prompts on a Llama - 3.1 - 70B - Instruct model as the initial policy.

[nvidia/Llama - 3.1 - Nemotron - 70B - Instruct - HF](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instr) has been converted from [Llama - 3.1 - Nemotron - 70B - Instruct](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instruct) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama - 3.1 - Nemotron - 70B - Instruct](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Instruct) as evaluated in NeMo - Aligner, which the evaluation results are based on.

Quantization Reproduction Information

To quantize Llama 3.1 Nemotron 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.

First, install the following packages:

pip install -q --upgrade transformers autoawq accelerate

The quantization was produced using a single node with an Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090 (24GB VRAM each, for a total of 48 GB VRAM).

Adapted from the following sources:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

# Empty Cache
torch.cuda.empty_cache()

# Memory Limits - Set this according to your hardware limits
max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}

model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
quant_path = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
  
}

# Load model - Note: while this loads the layers into the CPU, the GPUs (and the VRAM) are still required for quantization! (Verified with nvida-smi)
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    use_cache=False,
    max_memory=max_memory,
    device_map="cpu"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(
    tokenizer,
    quant_config=quant_config
)

# Save quantized model
model.save_quant

🔧 Technical Details

This model was quantized using [AutoAWQ](https://github.com/casper - hansen/AutoAWQ) from FP16 down to INT4 using GEMM kernels, with zero - point quantization and a group size of 128.

Hardware used: Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090. This should work on any platform that supports Llama 3.1 70B Instruct AWQ INT4.

📄 License

The license for this model is llama3.1.

⚠️ Important Note

This repository is an AWQ 4 - bit quantized version of the nvidia/Llama-3.1-Nemotron-70B-Instruct-HF model. Note from Terrell: Quantization to AWQ 4 - bit will further affect evaluation results.

⚠️ Important Note

In order to run inference with Llama 3.1 Nemotron 70B Instruct AWQ in INT4, around 35 GiB of VRAM are needed for loading the model checkpoint, without including the KV cache or the CUDA graphs, meaning that there should be a bit over that VRAM available.

⚠️ Important Note

In order to quantize Llama 3.1 Nemotron 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.

Property	Details
Model Type	AWQ 4 - bit quantized version of Llama - 3.1 - Nemotron - 70B - Instruct - HF
Training Data	nvidia/HelpSteer2
Base Model	nvidia/Llama - 3.1 - Nemotron - 70B - Instruct - HF
Library Name	transformers
Pipeline Tag	text - generation
Tags	llama - 3.1, meta, autoawq
Supported Languages	en, de, fr, it, pt, hi, es, th

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご