Zephyr 7B Beta - 開源AWQ量化優化模型，高效應對推理任務！

首頁

Zephyr 7B Beta AWQ

由TheBloke開發

Zephyr 7B Beta是基於Hugging Face H4的Mistral架構的7B參數模型，經過AWQ量化優化，適用於高效的推理任務。

大型語言模型

Transformers

英語開源協議:MIT #高效4位量化 #多平臺推理支持 #對話系統優化

下載量 1,728

發布時間 : 10/27/2023

模型概述

Zephyr 7B Beta是一個高效的語言模型，通過AWQ量化技術優化，適用於多種推理環境，支持文本生成任務。

模型特點

高效量化

採用AWQ方法進行4位量化，顯著減少內存佔用和推理時間，同時保持較高的精度。

多平臺支持

支持在text-generation-webui、vLLM、Hugging Face Text Generation Inference (TGI)和AutoAWQ等平臺上進行推理。

多版本可用

提供AWQ、GPTQ和GGUF等多種量化版本的模型，滿足不同需求。

模型能力

文本生成

對話系統

問答系統

使用案例

對話系統

智能對話

用於構建智能對話系統，支持自然語言交互。

生成流暢、自然的對話回覆。

問答系統

知識問答

用於回答用戶提出的各種問題。

提供準確、相關的答案。

🚀 Zephyr 7B Beta - AWQ

本項目提供了基於Hugging Face H4的Zephyr 7B Beta模型的AWQ量化版本，可用於高效的推理任務。通過AWQ量化，模型在保持一定精度的同時，能顯著減少內存佔用和推理時間，適用於多種推理環境。

🚀 快速開始

本項目提供了Zephyr 7B Beta模型的AWQ量化版本，以下是關於該模型的詳細信息和使用指南。

✨ 主要特性

高效量化：採用AWQ方法進行量化，支持4位量化，在保持精度的同時提升推理速度。
多平臺支持：支持在text-generation-webui、vLLM、Hugging Face Text Generation Inference (TGI)和AutoAWQ等平臺上進行推理。
多版本可用：除了AWQ模型，還提供了GPTQ和GGUF等不同量化版本的模型。

📦 安裝指南

在text-generation-webui中使用

請確保使用的是最新版本的text-generation-webui。強烈建議使用一鍵安裝程序，除非你確定知道如何手動安裝。

點擊Model tab。
在Download custom model or LoRA下，輸入TheBloke/zephyr-7B-beta-AWQ。
點擊Download。
模型將開始下載，下載完成後會顯示“Done”。
在左上角，點擊Model旁邊的刷新圖標。
在Model下拉菜單中，選擇你剛剛下載的模型：zephyr-7B-beta-AWQ。
選擇Loader: AutoAWQ。
點擊Load，模型將加載並準備使用。
如果你需要自定義設置，設置完成後點擊Save settings for this model，然後在右上角點擊Reload the Model。
準備好後，點擊Text Generation標籤，輸入提示信息即可開始！

使用AutoAWQ從Python代碼進行推理

安裝AutoAWQ包

需要安裝AutoAWQ 0.1.1或更高版本。

pip3 install autoawq

如果你在使用預構建的輪子安裝AutoAWQ時遇到問題，可以從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用示例

基礎用法

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/zephyr-7B-beta-AWQ"

# 加載分詞器
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
# 加載模型
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# 生成輸出
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# 獲取輸出的標記，解碼並打印
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

📚 詳細文檔

可用的倉庫

提示模板：Zephyr

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

提供的文件和AWQ參數

在首次發佈AWQ模型時，僅發佈128g模型。如果有需求，並且在完成困惑度和評估比較後，會考慮添加32g模型，但目前32g模型仍未在AutoAWQ和vLLM上進行充分測試。模型以分片的safetensors文件形式發佈。

分支	位數	GS	AWQ數據集	序列長度	大小
main	4	128	wikitext	4096	4.15 GB

多用戶推理服務器：vLLM

有關安裝和使用vLLM的文檔，請參閱此處。

請確保使用的是vLLM版本0.2或更高版本。
使用vLLM作為服務器時，請傳遞--quantization awq參數。例如：

python3 python -m vllm.entrypoints.api_server --model TheBloke/zephyr-7B-beta-AWQ --quantization awq

從Python代碼使用vLLM時，同樣設置quantization=awq。例如：

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/zephyr-7B-beta-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# 打印輸出
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

多用戶推理服務器：Hugging Face Text Generation Inference (TGI)

使用TGI版本1.1.0或更高版本。官方Docker容器為：ghcr.io/huggingface/text-generation-inference:1.1.0 示例Docker參數：

--model-id TheBloke/zephyr-7B-beta-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

與TGI交互的示例Python代碼（需要huggingface-hub 0.17.0或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

🔧 技術細節

兼容性

提供的文件經過測試，可與以下工具配合使用：

text-generation-webui，使用Loader: AutoAWQ。
vLLM版本0.2.0及更高版本。
Hugging Face Text Generation Inference (TGI)版本1.1.0及更高版本。
AutoAWQ版本0.1.1及更高版本。

📄 許可證

本項目採用MIT許可證。

模型信息表格

屬性	詳情
模型類型	Mistral
訓練數據	HuggingFaceH4/ultrachat_200k、HuggingFaceH4/ultrafeedback_binarized
許可證	MIT
模型創建者	Hugging Face H4
模型名稱	Zephyr 7B Beta
量化者	TheBloke