Qwen2.5-7B-Instruct量化開源模型 - 多語言場景適用，優化內存與計算量

首頁

Qwen2.5 7B Instruct Quantized.w8a8

由RedHatAI開發

Qwen2.5-7B-Instruct的INT8量化版本，適用於商業和研究用途的多語言場景，優化了內存需求和計算吞吐量。

大型語言模型

Safetensors

英語開源協議:Apache-2.0 #INT8量化 #多語言助手 #高效推理

下載量 412

發布時間 : 10/9/2024

模型概述

該模型是基於Qwen2.5-7B-Instruct的INT8量化版本，通過減少權重和激活的表示位數，降低了GPU內存需求並提高了計算效率。適用於類似助手的聊天功能。

模型特點

INT8量化

通過對權重和激活進行INT8量化，顯著降低了GPU內存需求和磁盤空間佔用，同時提高了計算吞吐量。

高效部署

支持使用vLLM後端高效部署，適用於大規模生產環境。

多語言支持

適用於多語言場景，特別適合商業和研究用途。

模型能力

文本生成

多語言聊天

商業和研究用途

使用案例

聊天助手

多語言聊天

用於類似助手的聊天功能，支持多語言交互。

提供流暢的對話體驗，適用於商業和研究場景。

商業應用

客戶支持

用於自動化客戶支持系統，提供快速響應。

降低人力成本，提高客戶滿意度。

🚀 Qwen2.5-7B-Instruct量化模型（w8a8）

本項目是對Qwen2.5-7B-Instruct模型進行量化處理後的版本，將權重和激活值量化為INT8數據類型，有效降低了GPU內存需求和磁盤空間佔用，同時提升了計算吞吐量，可用於多語言的商業和研究場景。

🚀 快速開始

本模型可以使用 vLLM 後端進行高效部署，示例代碼如下：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language model."},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 還支持兼容 OpenAI 的服務，更多詳細信息請參閱文檔。

✨ 主要特性

模型架構：基於 Qwen2 架構，輸入和輸出均為文本。
模型優化：將 Qwen2.5-7B-Instruct 模型的激活值和權重量化為 INT8 數據類型，減少了 GPU 內存需求（約 50%），提高了矩陣乘法的計算吞吐量（約 2 倍），同時磁盤空間需求也減少了約 50%。
預期用例：適用於多語言的商業和研究用途，類似於 Qwen2.5-7B，可用於類似助手的聊天場景。
適用範圍：不得用於任何違反適用法律法規（包括貿易合規法律）的方式。

📚 詳細文檔

模型優化

本模型是通過將 Qwen2.5-7B-Instruct 模型的激活值和權重量化為 INT8 數據類型得到的。僅對 Transformer 塊內線性算子的權重和激活值進行量化，權重採用對稱靜態逐通道量化方案，激活值採用對稱動態逐令牌量化方案。量化過程結合了 SmoothQuant 和 GPTQ 算法，具體實現使用了 llm-compressor 庫。

模型創建

本模型使用 llm-compressor 創建，具體代碼如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier  
from llmcompressor.transformers import oneshot
from datasets import load_dataset

# Load model
model_stub = "Qwen/Qwen2.5-7B-Instruct"
model_name = model_stub.split("/")[-1]

num_samples = 512
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

def preprocess_fn(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(
        smoothing_strength=0.8,
        mappings=[
            [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
            [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
            [["re:.*down_proj"], "re:.*up_proj"],
        ],
    ),
    GPTQModifier(
        ignore=["lm_head"],
        sequential_targets=["Qwen2DecoderLayer"],
        dampening_frac=0.01,
        targets="Linear",
        scheme="W8A8",
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w8a8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

模型評估

本模型在 OpenLLM 排行榜任務（版本 1）上進行了評估，使用 lm-evaluation-harness（提交版本 387Bbd54bc621086e05aa1b030d8d4d5635b25e6）和 vLLM 引擎，評估命令如下：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,add_bos_token=True,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --batch_size auto

準確率

以下是在 Open LLM 排行榜上的評估得分：

基準測試	Qwen2.5-7B-Instruct	Qwen2.5-7B-Instruct量化模型（w8a8）	恢復率
MMLU (5-shot)	74.24	73.87	99.5%
ARC Challenge (25-shot)	63.40	63.23	99.7%
GSM-8K (5-shot, strict-match)	80.36	80.74	100.5%
Hellaswag (10-shot)	81.52	81.06	99.4%
Winogrande (5-shot)	74.66	74.82	100.2%
TruthfulQA (0-shot, mc2)	64.76	64.58	99.7%
平均	73.16	73.05	99.4%