Meta-Llama-3.1-8B-Instruct-GPTQ-INT4開源模型 - 免費支持多語言對話交流

首頁

Meta Llama 3.1 8B Instruct GPTQ INT4

由hugging-quants開發

這是Meta-Llama-3.1-8B-Instruct模型的INT4量化版本，使用GPTQ算法進行量化，適用於多語言對話場景。

大型語言模型

Transformers

支持多種語言#多語言對話 #指令微調 #GPTQ量化

下載量 128.18k

發布時間 : 7/24/2024

模型概述

Llama 3.1 8B Instruct是一個指令調優的大語言模型，針對多語言對話進行了優化，支持多種語言。本版本是原模型的INT4量化版本，降低了顯存需求。

模型特點

多語言支持

支持包括英語、德語、法語、意大利語等多種語言的文本生成

指令調優

針對對話場景進行了專門的指令調優，能更好地理解並執行用戶指令

高效量化

使用GPTQ算法進行INT4量化，顯著降低顯存需求同時保持較好的模型性能

模型能力

多語言文本生成

對話式交互

指令理解與執行

知識問答

使用案例

智能助手

多語言客服機器人

構建支持多種語言的智能客服系統

能流暢處理不同語言的客戶諮詢

教育

語言學習助手

幫助語言學習者練習對話和寫作

提供自然流暢的目標語言交流體驗

🚀 Meta Llama 3.1 8B Instruct量化模型

本項目是原始模型meta-llama/Meta-Llama-3.1-8B-Instruct的社區驅動量化版本，該原始模型是Meta AI發佈的FP16半精度官方版本。此量化模型使用AutoGPTQ將模型從FP16量化到INT4，使用GPTQ內核進行零點量化，組大小為128。

🚀 快速開始

在使用本量化模型之前，請確保你已瞭解運行推理所需的環境要求。運行Llama 3.1 8B Instruct GPTQ的INT4推理時，僅加載模型檢查點就大約需要4 GiB的VRAM，且不包括KV緩存或CUDA圖，因此可用的VRAM應略多於這個數值。

✨ 主要特性

多語言支持：支持英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語等多種語言。
量化優化：使用AutoGPTQ將模型從FP16量化到INT4，減少內存佔用，提高推理效率。
多種使用方式：支持通過transformers、autogptq、text-generation-inference和vLLM等不同解決方案進行推理。

📦 安裝指南

使用`transformers`或`AutoGPTQ`運行推理

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

使用`text-generation-inference`運行推理

pip install -q --upgrade huggingface_hub
huggingface-cli login

使用`vLLM`運行推理

需要安裝Docker（見安裝說明）。

💻 使用示例

基礎用法

🤗 transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

AutoGPTQ

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoGPTQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

高級用法

🤗 Text Generation Inference (TGI)

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  -e QUANTIZE=gptq \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

發送請求到部署的TGI端點：

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

通過Python客戶端發送請求：

import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))

chat_completion = client.chat.completions.create(
  model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

vLLM

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  --quantization gptq_marlin \
  --max-model-len 4096

發送請求到部署的vLLM端點：

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

通過Python客戶端發送請求：

import os
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))

chat_completion = client.chat.completions.create(
  model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

🔧 技術細節

量化復現

量化Llama 3.1 8B Instruct到INT4的GPTQ時，需要使用至少有足夠CPU RAM來容納整個模型（約8GiB）的實例，以及具有16GiB VRAM的NVIDIA GPU進行量化。

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

運行以下腳本進行量化：

import random

import numpy as np
import torch

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer

pretrained_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"

print("Loading tokenizer, dataset, and tokenizing the dataset...")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train")
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")

print("Setting random seeds...")
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)

print("Setting calibration samples...")
nsamples = 128
seqlen = 2048
calibration_samples = []
for _ in range(nsamples):
    i = random.randint(0, encodings.input_ids.shape[1] - seqlen - 1)
    j = i + seqlen
    input_ids = encodings.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    calibration_samples.append({"input_ids": input_ids, "attention_mask": attention_mask})

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=True,  # set to False can significantly speed up inference but the perplexity may slightly bad
    sym=True,  # using symmetric quantization so that the range is symmetric allowing the value 0 to be precisely represented (can provide speedups)
    damp_percent=0.1,  # see https://github.com/AutoGPTQ/AutoGPTQ/issues/196
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
print("Load unquantized model...")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
print("Quantize model with calibration samples...")
model.quantize(calibration_samples)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

📄 許可證

本項目使用的許可證為llama3.1。

屬性	詳情
模型類型	多語言大語言模型（LLMs）
訓練數據	未提及

⚠️ 重要提示

本倉庫是原始模型的社區驅動量化版本。運行Llama 3.1 8B Instruct GPTQ的INT4推理時，僅加載模型檢查點就大約需要4 GiB的VRAM，且不包括KV緩存或CUDA圖，因此可用的VRAM應略多於這個數值。量化Llama 3.1 8B Instruct時，需要使用至少有足夠CPU RAM來容納整個模型（約8GiB）的實例，以及具有16GiB VRAM的NVIDIA GPU進行量化。