🚀 AWQ Int4 Quantization of Llama-3.1-Nemotron-70B-Instruct
This project presents the AWQ Int4 quantization of Llama-3.1-Nemotron-70B-Instruct, aiming to improve the efficiency and performance of the large language model.
🚀 Quick Start
Prerequisites
To use this model, you need 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accommodate the download. This code has been tested on Transformers v4.44.0, torch v2.4.0 and 2 A100 80GB GPUs, but any setup that supports meta-llama/Llama-3.1-70B-Instruct
should support this model as well. If you run into problems, you can consider doing pip install -U transformers
.
Usage Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "How many r in strawberry?"
messages = [{"role": "user", "content": prompt}]
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)
✨ Features
- Multi - language support: Supports multiple languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- High - performance quantization: Uses AWQ Int4 quantization to optimize the model.
- Good performance: Achieves high scores on multiple evaluation benchmarks, outperforming many strong models.
📦 Installation
No specific installation steps other than the general requirements mentioned above are provided in the original document.
💻 Usage Examples
Basic Usage
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct/"
quant_path = "./quantized/Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
model = AutoAWQForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
Advanced Usage
The above basic usage example can also be considered as a starting point for more complex scenarios. You can adjust the quantization configuration parameters according to your needs.
📚 Documentation
Model Overview
This is an AWQ Int4 quantization of Llama-3.1-Nemotron-70B-Instruct. Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
Reproducing Quantization Results
To reproduce this model, run the code in the basic usage example above.
Troubleshooting
If you run into errors on a multi - GPU machine, setting CUDA_VISIBLE_DEVICES=0 helps.
Model Performance
As of 1 Oct 2024, this model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98. It is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
Evaluation Metrics
Model |
Arena Hard |
AlpacaEval |
MT - Bench |
Mean Response Length |
Details |
(95% CI) |
2 LC (SE) |
(GPT - 4 - Turbo) |
(# of Characters for MT - Bench) |
Llama - 3.1 - Nemotron - 70B - Instruct |
85.0 (-1.5, 1.5) |
57.6 (1.65) |
8.98 |
2199.8 |
Llama - 3.1 - 70B - Instruct |
55.7 (-2.9, 2.7) |
38.1 (0.90) |
8.22 |
1728.6 |
Llama - 3.1 - 405B - Instruct |
69.3 (-2.4, 2.2) |
39.3 (1.43) |
8.49 |
1664.7 |
Claude - 3 - 5 - Sonnet - 20240620 |
79.2 (-1.9, 1.7) |
52.4 (1.47) |
8.81 |
1619.9 |
GPT - 4o - 2024 - 05 - 13 |
79.3 (-2.1, 2.0) |
57.5 (1.47) |
8.74 |
1752.2 |
Terms of Use
By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/USE_POLICY.md) and Meta’s privacy policy.
Model Architecture
Property |
Details |
Model Type |
Transformer |
Network Architecture |
Llama 3.1 |
Input
Property |
Details |
Input Type(s) |
Text |
Input Format |
String |
Input Parameters |
One Dimensional (1D) |
Other Properties Related to Input |
Max of 128k tokens |
Output
Property |
Details |
Output Type(s) |
Text |
Output Format |
String |
Output Parameters |
One Dimensional (1D) |
Other Properties Related to Output |
Max of 4k tokens |
Software Integration
Property |
Details |
Supported Hardware Microarchitecture Compatibility |
NVIDIA Ampere, NVIDIA Hopper, NVIDIA Turing |
Supported Operating System(s) |
Linux |
Model Version
v1.0
Training & Evaluation
Alignment methodology
REINFORCE implemented in NeMo Aligner
Datasets
- Data Collection Method by dataset: [Hybrid: Human, Synthetic]
- Labeling Method by dataset: [Human]
- Link: HelpSteer2
- Properties: 21,362 prompt - responses built to make more models more aligned with human preference - specifically more helpful, factually - correct, coherent, and customizable based on complexity and verbosity. 20,324 prompt - responses used for training and 1,038 used for validation.
Inference
Property |
Details |
Engine |
[Triton](https://developer.nvidia.com/triton - inference - server) |
Test Hardware |
H100, A100 80GB, A100 40GB |
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).
Citation
If you find this model useful, please cite the following works
@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
title={HelpSteer2-Preference: Complementing Ratings with Preferences},
author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
year={2024},
eprint={2410.01257},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01257},
}
🔧 Technical Details
The model uses AWQ Int4 quantization to optimize the large - scale Llama - 3.1 - Nemotron - 70B - Instruct model. The quantization process is implemented using the AutoAWQForCausalLM
class from the awq
library. The model is trained using RLHF (specifically, REINFORCE) with [Llama - 3.1 - Nemotron - 70B - Reward](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Reward) and HelpSteer2 - Preference prompts on a Llama - 3.1 - 70B - Instruct model as the initial policy.
📄 License
The model is under the LLama 3.1 terms and conditions of the [license](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/USE_POLICY.md) and Meta’s privacy policy.