Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4開源大語言模型，NVIDIA定製，基準測試表現佳！

首頁

Llama 3.1 Nemotron 70B Instruct AWQ INT4

由joshmiller656開發

NVIDIA定製的70B參數大語言模型，通過AWQ Int4量化優化，在多個自動對齊基準測試中表現優異

大型語言模型

Transformers

支持多種語言#多語言指令優化 #高性能對齊基準 #NVIDIA定製增強

下載量 1,591

發布時間 : 10/29/2024

模型概述

專為提高大語言模型響應有效性而設計的指令微調模型，支持多語言交互

模型特點

高性能量化

採用AWQ Int4量化技術，在保持模型性能的同時顯著降低資源需求

多語言支持

支持8種主流語言的文本生成和理解

指令優化

通過NVIDIA定製優化，顯著提升對用戶查詢的響應質量

基準領先

在Arena Hard、AlpacaEval 2 LC和MT Bench等基準測試中超越GPT-4o和Claude 3.5

模型能力

多輪對話生成

多語言文本理解

複雜指令遵循

長文本生成（最大4k令牌）

上下文感知響應

使用案例

智能助手

問答系統

回答用戶各種知識性問題

在事實準確性基準測試中表現優異

內容生成

多語言內容創作

生成多種語言的營銷文案或創意內容

支持8種語言的流暢生成

🚀 Llama-3.1-Nemotron-70B-Instruct模型

本項目是對Llama-3.1-Nemotron-70B-Instruct模型進行的AWQ Int4量化，該模型由NVIDIA定製，能有效提升大語言模型對用戶查詢響應的有效性。

🚀 快速開始

復現量化結果

若要復現此模型，請運行以下代碼：

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct/"
quant_path = "./quantized/Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM",
}

# 加載模型
model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

# 保存量化後的模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

故障排除

若在多GPU機器上運行時遇到錯誤，可嘗試設置CUDA_VISIBLE_DEVICES=0。

使用模型

你可以使用HuggingFace Transformers庫來使用該模型，需要2塊或更多80GB的GPU（NVIDIA Ampere或更新版本），並至少有150GB的可用磁盤空間以完成下載。此代碼已在Transformers v4.44.0、torch v2.4.0和2塊A100 80GB GPU上進行測試，但任何支持meta-llama/Llama-3.1-70B-Instruct的設置也應支持此模型。若遇到問題，可考慮執行pip install -U transformers。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r in strawberry?"
messages = [{"role": "user", "content": prompt}]

tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)

# 查看模型卡片頂部的響應

✨ 主要特性

高性能表現：在多個自動對齊基準測試中表現出色，截至2024年10月1日，在Arena Hard、AlpacaEval 2 LC（驗證標籤）和MT Bench（GPT-4-Turbo）中表現最佳。
提升響應質量：由NVIDIA定製，可提高大語言模型生成的響應對用戶查詢的有效性。
支持多語言：支持英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語等多種語言。

📦 安裝指南

使用該模型需要安裝HuggingFace Transformers庫，若遇到問題，可執行pip install -U transformers。

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r in strawberry?"
messages = [{"role": "user", "content": prompt}]

tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)

📚 詳細文檔

模型描述

Llama-3.1-Nemotron-70B-Instruct是NVIDIA定製的大語言模型，旨在提高大語言模型生成的響應對用戶查詢的有效性。該模型在多個自動對齊基準測試中表現出色，截至2024年10月1日，在Arena Hard、AlpacaEval 2 LC（驗證標籤）和MT Bench（GPT-4-Turbo）中排名第一，超過了GPT-4o和Claude 3.5 Sonnet等強大的前沿模型。

評估指標

模型	Arena Hard（95% CI）	AlpacaEval 2 LC（SE）	MT-Bench（GPT-4-Turbo）	平均響應長度（MT-Bench字符數）
Llama-3.1-Nemotron-70B-Instruct	85.0 (-1.5, 1.5)	57.6 (1.65)	8.98	2199.8
Llama-3.1-70B-Instruct	55.7 (-2.9, 2.7)	38.1 (0.90)	8.22	1728.6
Llama-3.1-405B-Instruct	69.3 (-2.4, 2.2)	39.3 (1.43)	8.49	1664.7
Claude-3-5-Sonnet-20240620	79.2 (-1.9, 1.7)	52.4 (1.47)	8.81	1619.9
GPT-4o-2024-05-13	79.3 (-2.1, 2.0)	57.5 (1.47)	8.74	1752.2

術語使用

訪問此模型即表示你同意許可證、可接受使用政策和Meta的隱私政策的LLama 3.1條款和條件。

引用

若你發現此模型有用，請引用以下作品：

@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
      title={HelpSteer2-Preference: Complementing Ratings with Preferences}, 
      author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
      year={2024},
      eprint={2410.01257},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01257}, 
}

🔧 技術細節

模型架構

屬性	詳情
架構類型	Transformer
網絡架構	Llama 3.1

輸入

屬性	詳情
輸入類型	文本
輸入格式	字符串
輸入參數	一維（1D）
其他輸入相關屬性	最大128k個令牌

輸出

屬性	詳情
輸出類型	文本
輸出格式	字符串
輸出參數	一維（1D）
其他輸出相關屬性	最大4k個令牌

軟件集成

屬性	詳情
支持的硬件微架構兼容性	NVIDIA Ampere、NVIDIA Hopper、NVIDIA Turing
支持的操作系統	Linux

模型版本

v1.0

訓練與評估

對齊方法

在NeMo Aligner中實現的REINFORCE算法。

數據集

屬性	詳情
數據收集方法	混合：人工、合成
標註方法	人工
鏈接	HelpSteer2
屬性（數量、數據集描述、傳感器）	21,362個提示-響應，用於使更多模型更符合人類偏好，特別是更有幫助、事實準確、連貫，並可根據複雜性和詳細程度進行定製。其中20,324個提示-響應用於訓練，1,038個用於驗證。