The open-source large language model Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4, customized by NVIDIA, shows excellent performance in benchmark tests!

Llama 3.1 Nemotron 70B Instruct AWQ INT4

Developed by joshmiller656

A large language model with 70 billion parameters customized by NVIDIA, optimized through AWQ Int4 quantization, and performs excellently in multiple automatic alignment benchmark tests.

Large Language Model

Transformers

Supports Multiple Languages#Multilingual instruction optimization #High-performance alignment benchmark #NVIDIA customized enhancement

Downloads 1,591

Release Time : 10/29/2024

Model Overview

An instruction fine-tuning model designed to improve the effectiveness of large language model responses, supporting multilingual interaction.

Model Features

High-performance quantization

Adopts AWQ Int4 quantization technology to significantly reduce resource requirements while maintaining model performance.

Multilingual support

Supports text generation and understanding in 8 mainstream languages.

Instruction optimization

Significantly improves the response quality to user queries through NVIDIA customization and optimization.

Benchmark leading

Surpasses GPT - 4o and Claude 3.5 in benchmark tests such as Arena Hard, AlpacaEval 2 LC, and MT Bench.

Model Capabilities

Multi - turn dialogue generation

Multilingual text understanding

Complex instruction following

Long text generation (up to 4k tokens)

Context - aware response

Use Cases

Intelligent assistant

Question - answering system

Answer various knowledge - based questions from users.

Performs excellently in fact - accuracy benchmark tests.

Content generation

Multilingual content creation

Generate marketing copy or creative content in multiple languages.

Supports smooth generation in 8 languages.

🚀 AWQ Int4 Quantization of Llama-3.1-Nemotron-70B-Instruct

This project presents the AWQ Int4 quantization of Llama-3.1-Nemotron-70B-Instruct, aiming to improve the efficiency and performance of the large language model.

🚀 Quick Start

Prerequisites

To use this model, you need 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accommodate the download. This code has been tested on Transformers v4.44.0, torch v2.4.0 and 2 A100 80GB GPUs, but any setup that supports meta-llama/Llama-3.1-70B-Instruct should support this model as well. If you run into problems, you can consider doing pip install -U transformers.

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r in strawberry?"
messages = [{"role": "user", "content": prompt}]

tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)

# See response at top of model card

✨ Features

Multi - language support: Supports multiple languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
High - performance quantization: Uses AWQ Int4 quantization to optimize the model.
Good performance: Achieves high scores on multiple evaluation benchmarks, outperforming many strong models.

📦 Installation

No specific installation steps other than the general requirements mentioned above are provided in the original document.

💻 Usage Examples

Basic Usage

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct/"
quant_path = "./quantized/Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM",
}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

Advanced Usage

The above basic usage example can also be considered as a starting point for more complex scenarios. You can adjust the quantization configuration parameters according to your needs.

📚 Documentation

Model Overview

This is an AWQ Int4 quantization of Llama-3.1-Nemotron-70B-Instruct. Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

Reproducing Quantization Results

To reproduce this model, run the code in the basic usage example above.

Troubleshooting

If you run into errors on a multi - GPU machine, setting CUDA_VISIBLE_DEVICES=0 helps.

Model Performance

As of 1 Oct 2024, this model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98. It is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

Evaluation Metrics

Model	Arena Hard	AlpacaEval	MT - Bench	Mean Response Length
Details	(95% CI)	2 LC (SE)	(GPT - 4 - Turbo)	(# of Characters for MT - Bench)
Llama - 3.1 - Nemotron - 70B - Instruct	85.0 (-1.5, 1.5)	57.6 (1.65)	8.98	2199.8
Llama - 3.1 - 70B - Instruct	55.7 (-2.9, 2.7)	38.1 (0.90)	8.22	1728.6
Llama - 3.1 - 405B - Instruct	69.3 (-2.4, 2.2)	39.3 (1.43)	8.49	1664.7
Claude - 3 - 5 - Sonnet - 20240620	79.2 (-1.9, 1.7)	52.4 (1.47)	8.81	1619.9
GPT - 4o - 2024 - 05 - 13	79.3 (-2.1, 2.0)	57.5 (1.47)	8.74	1752.2

Terms of Use

By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/USE_POLICY.md) and Meta’s privacy policy.

Model Architecture

Property	Details
Model Type	Transformer
Network Architecture	Llama 3.1

Input

Property	Details
Input Type(s)	Text
Input Format	String
Input Parameters	One Dimensional (1D)
Other Properties Related to Input	Max of 128k tokens

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	One Dimensional (1D)
Other Properties Related to Output	Max of 4k tokens

Software Integration

Property	Details
Supported Hardware Microarchitecture Compatibility	NVIDIA Ampere, NVIDIA Hopper, NVIDIA Turing
Supported Operating System(s)	Linux

Model Version

v1.0

Training & Evaluation

Alignment methodology

REINFORCE implemented in NeMo Aligner

Datasets

Data Collection Method by dataset: [Hybrid: Human, Synthetic]
Labeling Method by dataset: [Human]
Link: HelpSteer2
Properties: 21,362 prompt - responses built to make more models more aligned with human preference - specifically more helpful, factually - correct, coherent, and customizable based on complexity and verbosity. 20,324 prompt - responses used for training and 1,038 used for validation.

Inference

Property	Details
Engine	[Triton](https://developer.nvidia.com/triton - inference - server)
Test Hardware	H100, A100 80GB, A100 40GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).

Citation

If you find this model useful, please cite the following works

@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
      title={HelpSteer2-Preference: Complementing Ratings with Preferences}, 
      author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
      year={2024},
      eprint={2410.01257},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01257}, 
}

🔧 Technical Details

The model uses AWQ Int4 quantization to optimize the large - scale Llama - 3.1 - Nemotron - 70B - Instruct model. The quantization process is implemented using the AutoAWQForCausalLM class from the awq library. The model is trained using RLHF (specifically, REINFORCE) with [Llama - 3.1 - Nemotron - 70B - Reward](https://huggingface.co/nvidia/Llama - 3.1 - Nemotron - 70B - Reward) and HelpSteer2 - Preference prompts on a Llama - 3.1 - 70B - Instruct model as the initial policy.

📄 License

The model is under the LLama 3.1 terms and conditions of the [license](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta - llama/llama - models/blob/main/models/llama3_1/USE_POLICY.md) and Meta’s privacy policy.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご