Llama-3.3-70B-Instruct量化模型开源 - 支持多语言，商研场景省资源高性能

首页

Llama 3.3 70B Instruct Quantized.w4a16

由 RedHatAI 开发

基于Meta-Llama-3.1架构的量化优化模型，支持多语言，适用于商业和研究场景，在减少资源需求的同时保持高性能。

大型语言模型

Transformers

支持多种语言#多语言大模型 #INT4量化优化 #商业研究通用

下载量 19.25k

发布时间 : 1/2/2025

模型简介

这是一个经过量化优化的70B参数大语言模型，通过INT4权重量化减少75%的存储和内存需求，支持多种语言的自然语言生成任务。

模型特点

高效量化

采用INT4权重量化技术，减少75%的磁盘大小和GPU内存需求

多语言支持

支持英语、法语、意大利语等8种语言的文本生成

高性能保持

量化后模型在多个基准测试中保持98%以上的原始模型性能

商业友好

适用于商业和研究用途，支持多种部署场景

模型能力

多语言文本生成

对话系统

代码生成

知识问答

文本摘要

使用案例

对话系统

多语言客服机器人

部署支持多种语言的智能客服系统

在MMLU基准测试中达到80.62%准确率

代码生成

编程辅助

帮助开发者生成和优化代码

HumanEval pass@1达到83.40%

教育研究

学术问答系统

构建教育领域的知识问答系统

在ARC Challenge基准测试中达到49.49%准确率

🚀 Llama-3.3-70B-Instruct-quantized.w4a16

Llama-3.3-70B-Instruct-quantized.w4a16 是经过量化优化的模型，基于 Meta-Llama-3.1 架构，支持多语言，可用于商业和研究场景，在减少资源需求的同时保持了较高的性能。

🚀 快速开始

本模型可以使用 vLLM 后端高效部署，示例代码如下：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 还支持与 OpenAI 兼容的服务，更多详情请参阅文档。

✨ 主要特性

多语言支持：支持英语、法语、意大利语、葡萄牙语、印地语、西班牙语、泰语和德语等多种语言。
模型优化：通过将 Llama-3.3-70B-Instruct 的权重量化为 INT4 数据类型，减少了磁盘大小和 GPU 内存需求约 75%。
适用场景广泛：适用于商业和研究用途，可用于类似助手的聊天以及各种自然语言生成任务。

📦 安装指南

在 Red Hat AI Inference Server 上部署

podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

更多详情请参阅 Red Hat AI Inference Server 文档。

在 Red Hat Enterprise Linux AI 上部署

# 从 Red Hat Registry 通过 docker 下载模型
# 注意：除非指定 --model-dir，否则模型将下载到 ~/.cache/instructlab/models
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-3-70b-instruct-quantized-w4a16:1.5

# 通过 ilab 提供模型服务
ilab model serve --model-path ~/.cache/instructlab/models/llama-3-3-70b-instruct-quantized-w4a16
  
# 与模型进行对话
ilab model chat --model ~/.cache/instructlab/models/llama-3-3-70b-instruct-quantized-w4a16

更多详情请参阅 Red Hat Enterprise Linux AI 文档。

在 Red Hat Openshift AI 上部署

# 使用 ServingRuntime 设置 vllm 服务器
# 保存为: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # 可选更改: 设置唯一名称
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.20-cuda # 根据需要更改。如果是 AMD: quay.io/modh/vllm:rhoai-2.20-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# 将模型附加到 vllm 服务器。这是一个 NVIDIA 模板
# 保存为: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: llama-3-3-70b-instruct-quantized-w4a16 # 可选更改
    serving.kserve.io/deploymentMode: RawDeployment
  name: llama-3-3-70b-instruct-quantized-w4a16          # 指定模型名称。此值将用于在有效负载中调用模型
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# 这是特定于模型的
          memory: 8Gi		# 这是特定于模型的
          nvidia.com/gpu: '1'	# 这是特定于加速器的
        requests:			# 此块同理
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# 必须与上面的 ServingRuntime 名称匹配
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-quantized-w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# 首先确保位于要部署模型的项目中
# oc project <项目名称>

# 应用两个资源以运行模型

# 应用 ServingRuntime
oc apply -f vllm-servingruntime.yaml

# 应用 InferenceService
oc apply -f qwen-inferenceservice.yaml

# 替换下面的 <推理服务名称> 和 <集群入口域名>:
# - 如果不确定，请运行 `oc get inferenceservice` 查找您的 URL。

# 使用 curl 调用服务器:
curl https://<推理服务名称>-predictor-default.<域名>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "llama-3-3-70b-instruct-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

更多详情请参阅 Red Hat Openshift AI 文档。

💻 使用示例

基础用法

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

📚 详细文档

模型概述

属性	详情
模型类型	Meta-Llama-3.1
输入	文本
输出	文本
模型优化	权重量化为 INT4
预期用例	适用于多语言的商业和研究用途，可用于类似助手的聊天以及各种自然语言生成任务，还支持利用其模型输出改进其他模型，包括合成数据生成和蒸馏
适用范围外情况	以任何违反适用法律法规（包括贸易合规法律）的方式使用；以可接受使用政策和 Llama 3.3 社区许可证禁止的任何其他方式使用；使用英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语以外的语言
发布日期	2024 年 12 月 11 日
版本	1.0
许可证	llama3.3
模型开发者	Red Hat (Neural Magic)

模型优化

此模型是通过将 Llama-3.3-70B-Instruct 的权重量化为 INT4 数据类型获得的。这种优化将每个参数的位数从 16 位减少到 4 位，将磁盘大小和 GPU 内存需求减少了约 75%。仅对 Transformer 块内线性算子的权重进行了量化，权重使用对称的每组方案进行量化，组大小为 128。量化应用了 GPTQ 算法，该算法在 llm-compressor 库中实现。

创建详情

此模型使用 llm-compressor 创建，代码片段如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from datasets import load_dataset

# 加载模型
model_stub = "meta-llama/Llama-3.3-70B-Instruct"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

def preprocess_fn(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# 配置量化算法和方案
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    sequential_targets=["LlamaDecoderLayer"],
    dampening_frac=0.01,
)

# 应用量化
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# 以压缩张量格式保存到磁盘
save_path = model_name + "-quantized.w4a16"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"模型和分词器保存到: {save_path}")

评估

此模型在著名的 OpenLLM v1、HumanEval 和 HumanEval+ 基准测试中进行了评估。在所有情况下，模型输出均使用 vLLM 引擎生成。OpenLLM v1 评估使用 lm-evaluation-harness 进行，并在可用时使用 Meta-Llama-3.1-Instruct-evals 的提示风格。HumanEval 和 HumanEval+ 评估使用 Neural Magic 对 EvalPlus 仓库的分支进行。

评估详情

**MMLU** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \ --tasks mmlu_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 5 \ --batch_size auto ``` **MMLU-CoT** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \ --tasks mmlu_cot_llama \ --apply_chat_template \ --num_fewshot 0 \ --batch_size auto ``` **ARC-Challenge** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \ --tasks arc_challenge_llama \ --apply_chat_template \ --num_fewshot 0 \ --batch_size auto ``` **GSM-8K** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \ --tasks gsm8k_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 8 \ --batch_size auto ``` **Hellaswag** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks hellaswag \ --num_fewshot 10 \ --batch_size auto ``` **Winogrande** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks winogrande \ --num_fewshot 5 \ --batch_size auto ``` **TruthfulQA** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks truthfulqa \ --num_fewshot 0 \ --batch_size auto ``` **HumanEval 和 HumanEval+** *生成* ``` python3 codegen/generate.py \ --model RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 \ --bs 16 \ --temperature 0.2 \ --n_samples 50 \ --root "." \ --dataset humaneval ``` *清理* ``` python3 evalplus/sanitize.py \ humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2 ``` *评估* ``` evalplus.evaluate \ --dataset humaneval \ --samples humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2-sanitized ```

准确率

类别	基准测试	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-quantized.w4a16（本模型）	恢复率
OpenLLM v1	MMLU (5-shot)	81.60	80.62	98.8%
OpenLLM v1	MMLU (CoT, 0-shot)	86.58	85.81	99.1%
OpenLLM v1	ARC Challenge (0-shot)	49.23	49.49	100.5%
OpenLLM v1	GSM-8K (CoT, 8-shot, strict-match)	94.16	94.47	100.3%
OpenLLM v1	Hellaswag (10-shot)	86.49	85.97	99.4%
OpenLLM v1	Winogrande (5-shot)	84.77		%
OpenLLM v1	TruthfulQA (0-shot, mc2)	62.75	61.66	98.3%
OpenLLM v1	平均	77.94	77.49	98.3%
编码	HumanEval pass@1	83.20	83.40	100.2%
编码	HumanEval+ pass@1	78.40	78.60	100.3%