模型简介
模型特点
模型能力
使用案例
🚀 Meta Llama 3.1 8B Instruct量化模型
本项目是原始模型meta-llama/Meta-Llama-3.1-8B-Instruct
的社区驱动量化版本,该原始模型是Meta AI发布的FP16半精度官方版本。此量化模型使用AutoGPTQ将模型从FP16量化到INT4,使用GPTQ内核进行零点量化,组大小为128。
🚀 快速开始
在使用本量化模型之前,请确保你已了解运行推理所需的环境要求。运行Llama 3.1 8B Instruct GPTQ的INT4推理时,仅加载模型检查点就大约需要4 GiB的VRAM,且不包括KV缓存或CUDA图,因此可用的VRAM应略多于这个数值。
✨ 主要特性
- 多语言支持:支持英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语等多种语言。
- 量化优化:使用AutoGPTQ将模型从FP16量化到INT4,减少内存占用,提高推理效率。
- 多种使用方式:支持通过
transformers
、autogptq
、text-generation-inference
和vLLM
等不同解决方案进行推理。
📦 安装指南
使用transformers
或AutoGPTQ
运行推理
pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq
使用text-generation-inference
运行推理
pip install -q --upgrade huggingface_hub
huggingface-cli login
使用vLLM
运行推理
需要安装Docker(见安装说明)。
💻 使用示例
基础用法
🤗 transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
AutoGPTQ
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
高级用法
🤗 Text Generation Inference (TGI)
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
-v hf_cache:/data \
-e MODEL_ID=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
-e QUANTIZE=gptq \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e MAX_INPUT_LENGTH=4000 \
-e MAX_TOTAL_TOKENS=4096 \
ghcr.io/huggingface/text-generation-inference:2.2.0
发送请求到部署的TGI端点:
curl 0.0.0.0:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
通过Python客户端发送请求:
import os
from huggingface_hub import InferenceClient
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
chat_completion = client.chat.completions.create(
model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"},
],
max_tokens=128,
)
vLLM
docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
-v hf_cache:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
--quantization gptq_marlin \
--max-model-len 4096
发送请求到部署的vLLM端点:
curl 0.0.0.0:8000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
通过Python客户端发送请求:
import os
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
chat_completion = client.chat.completions.create(
model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"},
],
max_tokens=128,
)
🔧 技术细节
量化复现
量化Llama 3.1 8B Instruct到INT4的GPTQ时,需要使用至少有足够CPU RAM来容纳整个模型(约8GiB)的实例,以及具有16GiB VRAM的NVIDIA GPU进行量化。
pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq
运行以下脚本进行量化:
import random
import numpy as np
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer
pretrained_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
print("Loading tokenizer, dataset, and tokenizing the dataset...")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train")
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")
print("Setting random seeds...")
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)
print("Setting calibration samples...")
nsamples = 128
seqlen = 2048
calibration_samples = []
for _ in range(nsamples):
i = random.randint(0, encodings.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
input_ids = encodings.input_ids[:, i:j]
attention_mask = torch.ones_like(input_ids)
calibration_samples.append({"input_ids": input_ids, "attention_mask": attention_mask})
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=True, # set to False can significantly speed up inference but the perplexity may slightly bad
sym=True, # using symmetric quantization so that the range is symmetric allowing the value 0 to be precisely represented (can provide speedups)
damp_percent=0.1, # see https://github.com/AutoGPTQ/AutoGPTQ/issues/196
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
print("Load unquantized model...")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
print("Quantize model with calibration samples...")
model.quantize(calibration_samples)
# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)
📄 许可证
本项目使用的许可证为llama3.1。
属性 | 详情 |
---|---|
模型类型 | 多语言大语言模型(LLMs) |
训练数据 | 未提及 |
⚠️ 重要提示
本仓库是原始模型的社区驱动量化版本。运行Llama 3.1 8B Instruct GPTQ的INT4推理时,仅加载模型检查点就大约需要4 GiB的VRAM,且不包括KV缓存或CUDA图,因此可用的VRAM应略多于这个数值。量化Llama 3.1 8B Instruct时,需要使用至少有足够CPU RAM来容纳整个模型(约8GiB)的实例,以及具有16GiB VRAM的NVIDIA GPU进行量化。
💡 使用建议
建议根据实际需求选择合适的使用方式,如
transformers
适合常规开发,text-generation-inference
适合部署服务,vLLM
适合高性能推理。在使用不同工具时,请确保按照相应的安装和运行步骤进行操作。



