Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4開源模型 - 免費部署提升生成響應有用性

首頁

Nvidia Llama 3.1 Nemotron 70B Instruct HF AWQ INT4

由ibnzterrell開發

這是 NVIDIA 基於 Meta Llama-3.1-70B-Instruct 定製的 Llama-3.1-Nemotron-70B-Instruct 模型的 AWQ 4位量化版本，專注於提升生成響應的有用性。

大型語言模型

Transformers

支持多種語言#多語言指令優化 #70B參數量化 #高性能對話生成

下載量 206

發布時間 : 10/24/2024

模型概述

該模型是大型語言模型，經過優化以提供高質量的回答，支持多種語言，適用於文本生成任務。

模型特點

高性能量化

使用 AutoAWQ 從 FP16 量化至 INT4，採用 GEMM 內核、零點量化和 128 的分組大小，優化推理效率。

多語言支持

支持包括英語、德語、法語、西班牙語等在內的多種語言，適用於國際化應用場景。

強化對齊訓練

使用 RLHF 和 HelpSteer2-Preference prompts 進行強化學習對齊訓練，提升生成響應的有用性。

模型能力

文本生成

多語言支持

對話系統

使用案例

對話系統

智能客服

用於構建多語言智能客服系統，提供高質量的回答。

在 Arena Hard 上達到 85.0 分，AlpacaEval 2 LC 上達到 57.6 分。

內容生成

多語言內容創作

生成高質量的多語言文本內容，適用於新聞、博客等。

🚀 Llama 3.1-Nemotron-70B-Instruct-HF AWQ量化模型

本項目提供了nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本。該模型是NVIDIA基於Meta AI發佈的meta-llama/Meta-Llama-3.1-70B-Instruct定製的大語言模型。量化後的模型能在特定硬件上高效運行，同時保留了原模型的高性能。

🚀 快速開始

本倉庫是nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本，該模型是NVIDIA對meta-llama/Meta-Llama-3.1-70B-Instruct的定製版本，最初由Meta AI發佈。

此模型使用AutoAWQ從FP16量化到INT4，採用GEMM內核、零點量化和128的分組大小。

硬件要求：Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz、256GB RAM和2塊NVIDIA RTX 3090。任何支持LLama 3.1 70B Instruct AWQ INT4的平臺都應能運行該模型。

以下是Transformers、AutoAWQ、Text Generation Interface (TGI)和vLLM的模型使用（推理）信息，以及量化復現細節。

✨ 主要特性

量化模型特性

高效壓縮：通過AutoAWQ將模型從FP16量化到INT4，減少了內存佔用。
廣泛兼容：適用於支持LLama 3.1 70B Instruct AWQ INT4的平臺。

原始模型特性

高性能：在多個基準測試中表現出色，如在Arena Hard中達到85.0，AlpacaEval 2 LC中達到57.6，[GPT - 4 - Turbo MT - Bench](https://github.com/lm - sys/FastChat/pull/3158)中達到8.98。
排名領先：截至2024年10月1日，在三個自動對齊基準測試中排名第一；截至2024年10月24日，在ChatBot Arena排行榜上Elo得分為1267(±7)，排名第9，風格控制排名第26。

📦 安裝指南

Transformers

運行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，需安裝以下包：

pip install -q --upgrade transformers autoawq accelerate

AutoAWQ

運行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，需安裝以下包：

pip install -q --upgrade transformers autoawq accelerate

Text Generation Inference (TGI)

運行text - generation - launcher，需安裝Docker（見安裝說明）和huggingface_hub Python包，並登錄Hugging Face Hub：

pip install -q --upgrade huggingface_hub
huggingface-cli login

vLLM

運行vLLM，需安裝Docker（見安裝說明）。

💻 使用示例

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # 注意：根據用例更新此值
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

AutoAWQ

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

Text Generation Inference (TGI)

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  -e NUM_SHARD=4 \
  -e QUANTIZE=awq \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

發送請求示例：

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

vLLM

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  --tensor-parallel-size 4 \
  --max-model-len 4096

發送請求示例：

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

🔧 技術細節

量化細節

此模型使用AutoAWQ從FP16量化到INT4，採用GEMM內核、零點量化和128的分組大小。

硬件要求

CPU：Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz
內存：256GB RAM
GPU：2塊NVIDIA RTX 3090

量化復現

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

# 清空緩存
torch.cuda.empty_cache()

# 內存限制 - 根據硬件限制設置
max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}

model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
quant_path = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
  
}

# 加載模型 - 注意：雖然將層加載到CPU，但量化仍需要GPU和VRAM！(通過nvida-smi驗證)
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    use_cache=False,
    max_memory=max_memory,
    device_map="cpu"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化
model.quantize(
    tokenizer,
    quant_config=quant_config
)

# 保存量化模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'模型已量化並保存到 "{quant_path}"')

📄 許可證

本模型使用llama3.1許可證。

注意事項

⚠️ 重要提示

本倉庫是nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本。

⚠️ 重要提示

運行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，加載模型檢查點大約需要35 GiB的VRAM，不包括KV緩存或CUDA圖，即應確保有略多於該大小的VRAM可用。

⚠️ 重要提示

要使用AutoAWQ對Llama 3.1 Nemotron 70B Instruct進行量化，需要使用至少有足夠CPU RAM來容納整個模型（約140GiB）的實例，以及具有40GiB VRAM的NVIDIA GPU進行量化。

使用建議

💡 使用建議

在運行推理時，根據實際硬件情況調整代碼中的參數，如fuse_max_seq_len、max_memory等。

模型信息表格

屬性	詳情
模型類型	Llama-3.1-Nemotron-70B-Instruct-HF AWQ INT4量化模型
訓練數據	HelpSteer2-Preference prompts
基礎模型	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
支持語言	英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語、泰語
庫名稱	transformers
任務標籤	文本生成
標籤	llama-3.1、meta、autoawq