模型概述
模型特點
模型能力
使用案例
🚀 Meta Llama 3.1 8B Instruct量化模型
本項目是原始模型meta-llama/Meta-Llama-3.1-8B-Instruct
的社區驅動量化版本,該原始模型是Meta AI發佈的FP16半精度官方版本。此量化模型使用AutoGPTQ將模型從FP16量化到INT4,使用GPTQ內核進行零點量化,組大小為128。
🚀 快速開始
在使用本量化模型之前,請確保你已瞭解運行推理所需的環境要求。運行Llama 3.1 8B Instruct GPTQ的INT4推理時,僅加載模型檢查點就大約需要4 GiB的VRAM,且不包括KV緩存或CUDA圖,因此可用的VRAM應略多於這個數值。
✨ 主要特性
- 多語言支持:支持英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語等多種語言。
- 量化優化:使用AutoGPTQ將模型從FP16量化到INT4,減少內存佔用,提高推理效率。
- 多種使用方式:支持通過
transformers
、autogptq
、text-generation-inference
和vLLM
等不同解決方案進行推理。
📦 安裝指南
使用transformers
或AutoGPTQ
運行推理
pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq
使用text-generation-inference
運行推理
pip install -q --upgrade huggingface_hub
huggingface-cli login
使用vLLM
運行推理
需要安裝Docker(見安裝說明)。
💻 使用示例
基礎用法
🤗 transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
AutoGPTQ
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
高級用法
🤗 Text Generation Inference (TGI)
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
-v hf_cache:/data \
-e MODEL_ID=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
-e QUANTIZE=gptq \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e MAX_INPUT_LENGTH=4000 \
-e MAX_TOTAL_TOKENS=4096 \
ghcr.io/huggingface/text-generation-inference:2.2.0
發送請求到部署的TGI端點:
curl 0.0.0.0:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
通過Python客戶端發送請求:
import os
from huggingface_hub import InferenceClient
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
chat_completion = client.chat.completions.create(
model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"},
],
max_tokens=128,
)
vLLM
docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
-v hf_cache:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
--quantization gptq_marlin \
--max-model-len 4096
發送請求到部署的vLLM端點:
curl 0.0.0.0:8000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
通過Python客戶端發送請求:
import os
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
chat_completion = client.chat.completions.create(
model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"},
],
max_tokens=128,
)
🔧 技術細節
量化復現
量化Llama 3.1 8B Instruct到INT4的GPTQ時,需要使用至少有足夠CPU RAM來容納整個模型(約8GiB)的實例,以及具有16GiB VRAM的NVIDIA GPU進行量化。
pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq
運行以下腳本進行量化:
import random
import numpy as np
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer
pretrained_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
print("Loading tokenizer, dataset, and tokenizing the dataset...")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train")
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")
print("Setting random seeds...")
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)
print("Setting calibration samples...")
nsamples = 128
seqlen = 2048
calibration_samples = []
for _ in range(nsamples):
i = random.randint(0, encodings.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
input_ids = encodings.input_ids[:, i:j]
attention_mask = torch.ones_like(input_ids)
calibration_samples.append({"input_ids": input_ids, "attention_mask": attention_mask})
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=True, # set to False can significantly speed up inference but the perplexity may slightly bad
sym=True, # using symmetric quantization so that the range is symmetric allowing the value 0 to be precisely represented (can provide speedups)
damp_percent=0.1, # see https://github.com/AutoGPTQ/AutoGPTQ/issues/196
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
print("Load unquantized model...")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
print("Quantize model with calibration samples...")
model.quantize(calibration_samples)
# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)
📄 許可證
本項目使用的許可證為llama3.1。
屬性 | 詳情 |
---|---|
模型類型 | 多語言大語言模型(LLMs) |
訓練數據 | 未提及 |
⚠️ 重要提示
本倉庫是原始模型的社區驅動量化版本。運行Llama 3.1 8B Instruct GPTQ的INT4推理時,僅加載模型檢查點就大約需要4 GiB的VRAM,且不包括KV緩存或CUDA圖,因此可用的VRAM應略多於這個數值。量化Llama 3.1 8B Instruct時,需要使用至少有足夠CPU RAM來容納整個模型(約8GiB)的實例,以及具有16GiB VRAM的NVIDIA GPU進行量化。
💡 使用建議
建議根據實際需求選擇合適的使用方式,如
transformers
適合常規開發,text-generation-inference
適合部署服務,vLLM
適合高性能推理。在使用不同工具時,請確保按照相應的安裝和運行步驟進行操作。



