Meta-Llama-3.1-8B-Instruct-GPTQ-INT4开源模型 - 免费支持多语言对话交流

首页

Meta Llama 3.1 8B Instruct GPTQ INT4

由 hugging-quants 开发

这是Meta-Llama-3.1-8B-Instruct模型的INT4量化版本，使用GPTQ算法进行量化，适用于多语言对话场景。

大型语言模型

Transformers

支持多种语言#多语言对话 #指令微调 #GPTQ量化

下载量 128.18k

发布时间 : 7/24/2024

模型简介

Llama 3.1 8B Instruct是一个指令调优的大语言模型，针对多语言对话进行了优化，支持多种语言。本版本是原模型的INT4量化版本，降低了显存需求。

模型特点

多语言支持

支持包括英语、德语、法语、意大利语等多种语言的文本生成

指令调优

针对对话场景进行了专门的指令调优，能更好地理解并执行用户指令

高效量化

使用GPTQ算法进行INT4量化，显著降低显存需求同时保持较好的模型性能

模型能力

多语言文本生成

对话式交互

指令理解与执行

知识问答

使用案例

智能助手

多语言客服机器人

构建支持多种语言的智能客服系统

能流畅处理不同语言的客户咨询

教育

语言学习助手

帮助语言学习者练习对话和写作

提供自然流畅的目标语言交流体验

🚀 Meta Llama 3.1 8B Instruct量化模型

本项目是原始模型meta-llama/Meta-Llama-3.1-8B-Instruct的社区驱动量化版本，该原始模型是Meta AI发布的FP16半精度官方版本。此量化模型使用AutoGPTQ将模型从FP16量化到INT4，使用GPTQ内核进行零点量化，组大小为128。

🚀 快速开始

在使用本量化模型之前，请确保你已了解运行推理所需的环境要求。运行Llama 3.1 8B Instruct GPTQ的INT4推理时，仅加载模型检查点就大约需要4 GiB的VRAM，且不包括KV缓存或CUDA图，因此可用的VRAM应略多于这个数值。

✨ 主要特性

多语言支持：支持英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语等多种语言。
量化优化：使用AutoGPTQ将模型从FP16量化到INT4，减少内存占用，提高推理效率。
多种使用方式：支持通过transformers、autogptq、text-generation-inference和vLLM等不同解决方案进行推理。

📦 安装指南

使用`transformers`或`AutoGPTQ`运行推理

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

使用`text-generation-inference`运行推理

pip install -q --upgrade huggingface_hub
huggingface-cli login

使用`vLLM`运行推理

需要安装Docker（见安装说明）。

💻 使用示例

基础用法

🤗 transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

AutoGPTQ

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoGPTQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

高级用法

🤗 Text Generation Inference (TGI)

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  -e QUANTIZE=gptq \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

发送请求到部署的TGI端点：

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

通过Python客户端发送请求：

import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))

chat_completion = client.chat.completions.create(
  model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

vLLM

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  --quantization gptq_marlin \
  --max-model-len 4096

发送请求到部署的vLLM端点：

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

通过Python客户端发送请求：

import os
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))

chat_completion = client.chat.completions.create(
  model="hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

🔧 技术细节

量化复现

量化Llama 3.1 8B Instruct到INT4的GPTQ时，需要使用至少有足够CPU RAM来容纳整个模型（约8GiB）的实例，以及具有16GiB VRAM的NVIDIA GPU进行量化。

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

运行以下脚本进行量化：

import random

import numpy as np
import torch

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer

pretrained_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model_dir = "meta-llama/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"

print("Loading tokenizer, dataset, and tokenizing the dataset...")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train")
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")

print("Setting random seeds...")
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)

print("Setting calibration samples...")
nsamples = 128
seqlen = 2048
calibration_samples = []
for _ in range(nsamples):
    i = random.randint(0, encodings.input_ids.shape[1] - seqlen - 1)
    j = i + seqlen
    input_ids = encodings.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    calibration_samples.append({"input_ids": input_ids, "attention_mask": attention_mask})

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=True,  # set to False can significantly speed up inference but the perplexity may slightly bad
    sym=True,  # using symmetric quantization so that the range is symmetric allowing the value 0 to be precisely represented (can provide speedups)
    damp_percent=0.1,  # see https://github.com/AutoGPTQ/AutoGPTQ/issues/196
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
print("Load unquantized model...")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
print("Quantize model with calibration samples...")
model.quantize(calibration_samples)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

📄 许可证

本项目使用的许可证为llama3.1。

属性	详情
模型类型	多语言大语言模型（LLMs）
训练数据	未提及

⚠️ 重要提示

本仓库是原始模型的社区驱动量化版本。运行Llama 3.1 8B Instruct GPTQ的INT4推理时，仅加载模型检查点就大约需要4 GiB的VRAM，且不包括KV缓存或CUDA图，因此可用的VRAM应略多于这个数值。量化Llama 3.1 8B Instruct时，需要使用至少有足够CPU RAM来容纳整个模型（约8GiB）的实例，以及具有16GiB VRAM的NVIDIA GPU进行量化。