Zephyr 7B Beta - 开源AWQ量化优化模型，高效应对推理任务！

首页

Zephyr 7B Beta AWQ

由 TheBloke 开发

Zephyr 7B Beta是基于Hugging Face H4的Mistral架构的7B参数模型，经过AWQ量化优化，适用于高效的推理任务。

大型语言模型

Transformers

英语开源协议:MIT #高效4位量化 #多平台推理支持 #对话系统优化

下载量 1,728

发布时间 : 10/27/2023

模型简介

Zephyr 7B Beta是一个高效的语言模型，通过AWQ量化技术优化，适用于多种推理环境，支持文本生成任务。

模型特点

高效量化

采用AWQ方法进行4位量化，显著减少内存占用和推理时间，同时保持较高的精度。

多平台支持

支持在text-generation-webui、vLLM、Hugging Face Text Generation Inference (TGI)和AutoAWQ等平台上进行推理。

多版本可用

提供AWQ、GPTQ和GGUF等多种量化版本的模型，满足不同需求。

模型能力

文本生成

对话系统

问答系统

使用案例

对话系统

智能对话

用于构建智能对话系统，支持自然语言交互。

生成流畅、自然的对话回复。

问答系统

知识问答

用于回答用户提出的各种问题。

提供准确、相关的答案。

🚀 Zephyr 7B Beta - AWQ

本项目提供了基于Hugging Face H4的Zephyr 7B Beta模型的AWQ量化版本，可用于高效的推理任务。通过AWQ量化，模型在保持一定精度的同时，能显著减少内存占用和推理时间，适用于多种推理环境。

🚀 快速开始

本项目提供了Zephyr 7B Beta模型的AWQ量化版本，以下是关于该模型的详细信息和使用指南。

✨ 主要特性

高效量化：采用AWQ方法进行量化，支持4位量化，在保持精度的同时提升推理速度。
多平台支持：支持在text-generation-webui、vLLM、Hugging Face Text Generation Inference (TGI)和AutoAWQ等平台上进行推理。
多版本可用：除了AWQ模型，还提供了GPTQ和GGUF等不同量化版本的模型。

📦 安装指南

在text-generation-webui中使用

请确保使用的是最新版本的text-generation-webui。强烈建议使用一键安装程序，除非你确定知道如何手动安装。

点击Model tab。
在Download custom model or LoRA下，输入TheBloke/zephyr-7B-beta-AWQ。
点击Download。
模型将开始下载，下载完成后会显示“Done”。
在左上角，点击Model旁边的刷新图标。
在Model下拉菜单中，选择你刚刚下载的模型：zephyr-7B-beta-AWQ。
选择Loader: AutoAWQ。
点击Load，模型将加载并准备使用。
如果你需要自定义设置，设置完成后点击Save settings for this model，然后在右上角点击Reload the Model。
准备好后，点击Text Generation标签，输入提示信息即可开始！

使用AutoAWQ从Python代码进行推理

安装AutoAWQ包

需要安装AutoAWQ 0.1.1或更高版本。

pip3 install autoawq

如果你在使用预构建的轮子安装AutoAWQ时遇到问题，可以从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用示例

基础用法

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/zephyr-7B-beta-AWQ"

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
# 加载模型
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# 生成输出
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# 获取输出的标记，解码并打印
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

📚 详细文档

可用的仓库

提示模板：Zephyr

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

提供的文件和AWQ参数

在首次发布AWQ模型时，仅发布128g模型。如果有需求，并且在完成困惑度和评估比较后，会考虑添加32g模型，但目前32g模型仍未在AutoAWQ和vLLM上进行充分测试。模型以分片的safetensors文件形式发布。

分支	位数	GS	AWQ数据集	序列长度	大小
main	4	128	wikitext	4096	4.15 GB

多用户推理服务器：vLLM

有关安装和使用vLLM的文档，请参阅此处。

请确保使用的是vLLM版本0.2或更高版本。
使用vLLM作为服务器时，请传递--quantization awq参数。例如：

python3 python -m vllm.entrypoints.api_server --model TheBloke/zephyr-7B-beta-AWQ --quantization awq

从Python代码使用vLLM时，同样设置quantization=awq。例如：

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/zephyr-7B-beta-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# 打印输出
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

多用户推理服务器：Hugging Face Text Generation Inference (TGI)

使用TGI版本1.1.0或更高版本。官方Docker容器为：ghcr.io/huggingface/text-generation-inference:1.1.0 示例Docker参数：

--model-id TheBloke/zephyr-7B-beta-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

与TGI交互的示例Python代码（需要huggingface-hub 0.17.0或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

🔧 技术细节

兼容性

提供的文件经过测试，可与以下工具配合使用：

text-generation-webui，使用Loader: AutoAWQ。
vLLM版本0.2.0及更高版本。
Hugging Face Text Generation Inference (TGI)版本1.1.0及更高版本。
AutoAWQ版本0.1.1及更高版本。

📄 许可证

本项目采用MIT许可证。

模型信息表格

属性	详情
模型类型	Mistral
训练数据	HuggingFaceH4/ultrachat_200k、HuggingFaceH4/ultrafeedback_binarized
许可证	MIT
模型创建者	Hugging Face H4
模型名称	Zephyr 7B Beta
量化者	TheBloke