Llama-3.3-70B-Instruct量化模型開源 - 支持多語言，商研場景省資源高性能

首頁

Llama 3.3 70B Instruct Quantized.w4a16

由RedHatAI開發

基於Meta-Llama-3.1架構的量化優化模型，支持多語言，適用於商業和研究場景，在減少資源需求的同時保持高性能。

大型語言模型

Transformers

支持多種語言#多語言大模型 #INT4量化優化 #商業研究通用

下載量 19.25k

發布時間 : 1/2/2025

模型概述

這是一個經過量化優化的70B參數大語言模型，通過INT4權重量化減少75%的存儲和內存需求，支持多種語言的自然語言生成任務。

模型特點

高效量化

採用INT4權重量化技術，減少75%的磁盤大小和GPU內存需求

多語言支持

支持英語、法語、意大利語等8種語言的文本生成

高性能保持

量化後模型在多個基準測試中保持98%以上的原始模型性能

商業友好

適用於商業和研究用途，支持多種部署場景

模型能力

多語言文本生成

對話系統

代碼生成

知識問答

文本摘要

使用案例

對話系統

多語言客服機器人

部署支持多種語言的智能客服系統

在MMLU基準測試中達到80.62%準確率

代碼生成

編程輔助

幫助開發者生成和優化代碼

HumanEval pass@1達到83.40%

教育研究

學術問答系統

構建教育領域的知識問答系統

在ARC Challenge基準測試中達到49.49%準確率

🚀 Llama-3.3-70B-Instruct-quantized.w4a16

Llama-3.3-70B-Instruct-quantized.w4a16 是經過量化優化的模型，基於 Meta-Llama-3.1 架構，支持多語言，可用於商業和研究場景，在減少資源需求的同時保持了較高的性能。

🚀 快速開始

本模型可以使用 vLLM 後端高效部署，示例代碼如下：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 還支持與 OpenAI 兼容的服務，更多詳情請參閱文檔。

✨ 主要特性

多語言支持：支持英語、法語、意大利語、葡萄牙語、印地語、西班牙語、泰語和德語等多種語言。
模型優化：通過將 Llama-3.3-70B-Instruct 的權重量化為 INT4 數據類型，減少了磁盤大小和 GPU 內存需求約 75%。
適用場景廣泛：適用於商業和研究用途，可用於類似助手的聊天以及各種自然語言生成任務。

📦 安裝指南

在 Red Hat AI Inference Server 上部署

podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

更多詳情請參閱 Red Hat AI Inference Server 文檔。

在 Red Hat Enterprise Linux AI 上部署

# 從 Red Hat Registry 通過 docker 下載模型
# 注意：除非指定 --model-dir，否則模型將下載到 ~/.cache/instructlab/models
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-3-70b-instruct-quantized-w4a16:1.5

# 通過 ilab 提供模型服務
ilab model serve --model-path ~/.cache/instructlab/models/llama-3-3-70b-instruct-quantized-w4a16
  
# 與模型進行對話
ilab model chat --model ~/.cache/instructlab/models/llama-3-3-70b-instruct-quantized-w4a16

更多詳情請參閱 Red Hat Enterprise Linux AI 文檔。

在 Red Hat Openshift AI 上部署

# 使用 ServingRuntime 設置 vllm 服務器
# 保存為: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # 可選更改: 設置唯一名稱
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.20-cuda # 根據需要更改。如果是 AMD: quay.io/modh/vllm:rhoai-2.20-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# 將模型附加到 vllm 服務器。這是一個 NVIDIA 模板
# 保存為: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: llama-3-3-70b-instruct-quantized-w4a16 # 可選更改
    serving.kserve.io/deploymentMode: RawDeployment
  name: llama-3-3-70b-instruct-quantized-w4a16          # 指定模型名稱。此值將用於在有效負載中調用模型
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# 這是特定於模型的
          memory: 8Gi		# 這是特定於模型的
          nvidia.com/gpu: '1'	# 這是特定於加速器的
        requests:			# 此塊同理
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# 必須與上面的 ServingRuntime 名稱匹配
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-quantized-w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# 首先確保位於要部署模型的項目中
# oc project <項目名稱>

# 應用兩個資源以運行模型

# 應用 ServingRuntime
oc apply -f vllm-servingruntime.yaml

# 應用 InferenceService
oc apply -f qwen-inferenceservice.yaml

# 替換下面的 <推理服務名稱> 和 <集群入口域名>:
# - 如果不確定，請運行 `oc get inferenceservice` 查找您的 URL。

# 使用 curl 調用服務器:
curl https://<推理服務名稱>-predictor-default.<域名>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "llama-3-3-70b-instruct-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

更多詳情請參閱 Red Hat Openshift AI 文檔。

💻 使用示例

基礎用法

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

📚 詳細文檔

模型概述

屬性	詳情
模型類型	Meta-Llama-3.1
輸入	文本
輸出	文本
模型優化	權重量化為 INT4
預期用例	適用於多語言的商業和研究用途，可用於類似助手的聊天以及各種自然語言生成任務，還支持利用其模型輸出改進其他模型，包括合成數據生成和蒸餾
適用範圍外情況	以任何違反適用法律法規（包括貿易合規法律）的方式使用；以可接受使用政策和 Llama 3.3 社區許可證禁止的任何其他方式使用；使用英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語以外的語言
發佈日期	2024 年 12 月 11 日
版本	1.0
許可證	llama3.3
模型開發者	Red Hat (Neural Magic)

模型優化

此模型是通過將 Llama-3.3-70B-Instruct 的權重量化為 INT4 數據類型獲得的。這種優化將每個參數的位數從 16 位減少到 4 位，將磁盤大小和 GPU 內存需求減少了約 75%。僅對 Transformer 塊內線性算子的權重進行了量化，權重使用對稱的每組方案進行量化，組大小為 128。量化應用了 GPTQ 算法，該算法在 llm-compressor 庫中實現。

創建詳情

此模型使用 llm-compressor 創建，代碼片段如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from datasets import load_dataset

# 加載模型
model_stub = "meta-llama/Llama-3.3-70B-Instruct"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

def preprocess_fn(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# 配置量化算法和方案
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    sequential_targets=["LlamaDecoderLayer"],
    dampening_frac=0.01,
)

# 應用量化
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# 以壓縮張量格式保存到磁盤
save_path = model_name + "-quantized.w4a16"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"模型和分詞器保存到: {save_path}")

評估

此模型在著名的 OpenLLM v1、HumanEval 和 HumanEval+ 基準測試中進行了評估。在所有情況下，模型輸出均使用 vLLM 引擎生成。OpenLLM v1 評估使用 lm-evaluation-harness 進行，並在可用時使用 Meta-Llama-3.1-Instruct-evals 的提示風格。HumanEval 和 HumanEval+ 評估使用 Neural Magic 對 EvalPlus 倉庫的分支進行。

評估詳情

**MMLU** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \ --tasks mmlu_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 5 \ --batch_size auto ``` **MMLU-CoT** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \ --tasks mmlu_cot_llama \ --apply_chat_template \ --num_fewshot 0 \ --batch_size auto ``` **ARC-Challenge** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \ --tasks arc_challenge_llama \ --apply_chat_template \ --num_fewshot 0 \ --batch_size auto ``` **GSM-8K** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \ --tasks gsm8k_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 8 \ --batch_size auto ``` **Hellaswag** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks hellaswag \ --num_fewshot 10 \ --batch_size auto ``` **Winogrande** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks winogrande \ --num_fewshot 5 \ --batch_size auto ``` **TruthfulQA** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \ --tasks truthfulqa \ --num_fewshot 0 \ --batch_size auto ``` **HumanEval 和 HumanEval+** *生成* ``` python3 codegen/generate.py \ --model RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 \ --bs 16 \ --temperature 0.2 \ --n_samples 50 \ --root "." \ --dataset humaneval ``` *清理* ``` python3 evalplus/sanitize.py \ humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2 ``` *評估* ``` evalplus.evaluate \ --dataset humaneval \ --samples humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2-sanitized ```

準確率

類別	基準測試	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-quantized.w4a16（本模型）	恢復率
OpenLLM v1	MMLU (5-shot)	81.60	80.62	98.8%
OpenLLM v1	MMLU (CoT, 0-shot)	86.58	85.81	99.1%
OpenLLM v1	ARC Challenge (0-shot)	49.23	49.49	100.5%
OpenLLM v1	GSM-8K (CoT, 8-shot, strict-match)	94.16	94.47	100.3%
OpenLLM v1	Hellaswag (10-shot)	86.49	85.97	99.4%
OpenLLM v1	Winogrande (5-shot)	84.77		%
OpenLLM v1	TruthfulQA (0-shot, mc2)	62.75	61.66	98.3%
OpenLLM v1	平均	77.94	77.49	98.3%
編碼	HumanEval pass@1	83.20	83.40	100.2%
編碼	HumanEval+ pass@1	78.40	78.60	100.3%