Mistral-7B-Instruct-v0.1-GPTQ开源模型 - 支持两框架运行，高效处理各类任务

首页

Mistral 7B Instruct V0.1 GPTQ

由 TheBloke 开发

Mistral 7B Instruct v0.1 的 GPTQ 量化版本，支持在 ExLlama 或 Transformers 框架下运行

大型语言模型

Transformers

开源协议:Apache-2.0 #指令微调 #4/8位量化 #长序列处理

下载量 7,481

发布时间 : 9/28/2023

模型简介

这是一个基于 Mistral 7B Instruct v0.1 的 GPTQ 量化模型，提供了多种量化参数选择，适用于不同硬件环境下的推理需求。

模型特点

多量化参数支持

提供多种量化参数组合，用户可根据硬件和需求选择最合适的参数

多框架兼容

模型可以在 ExLlama 或 Transformers 框架下运行

高效推理

通过 GPTQ 量化技术减少模型大小和内存占用，同时保持较高的推理质量

长序列支持

支持长达 32768 的序列长度

模型能力

指令跟随

文本生成

对话系统

问答系统

使用案例

对话系统

智能助手

构建能够理解并响应自然语言指令的智能助手

内容生成

文章创作

根据提示生成连贯、有逻辑的文章内容

问答系统

知识问答

回答用户提出的各种知识性问题

🚀 Mistral 7B Instruct v0.1 - GPTQ

本项目提供了 Mistral AI 公司的 Mistral 7B Instruct v0.1 模型的 GPTQ 量化版本。该模型可以在 ExLlama 或 Transformers 框架下运行，满足不同用户的推理需求。

🚀 快速开始

环境准备

若要使用该模型，你需要安装以下依赖：

pip3 install optimum
pip3 install git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # 若使用 CUDA 11.7，将 cu118 替换为 cu117

若在安装 AutoGPTQ 时遇到问题，可以从源码进行安装：

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.4.2
pip3 install .

代码示例

以下是一个使用 Python 调用该模型的示例：

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
# 若要使用不同分支，修改 revision 参数
# 例如：revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# 也可以使用 transformers 的 pipeline 进行推理
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

✨ 主要特性

多量化参数支持：提供了多种量化参数组合，用户可以根据自身硬件和需求选择最合适的参数。
多框架兼容：模型可以在 ExLlama 或 Transformers 框架下运行。
多分支选择：每个量化版本都位于不同的分支，方便用户根据需求选择。

📦 安装指南

在 text-generation-webui 中下载

点击 Model tab。
在 Download custom model or LoRA 中输入 TheBloke/Mistral-7B-Instruct-v0.1-GPTQ。
- 若要从特定分支下载，可在后面添加 :branchname，例如 TheBloke/Mistral-7B-Instruct-v0.1-GPTQ:gptq-4bit-32g-actorder_True。
- 具体分支列表可参考下文的 提供的文件和 GPTQ 参数 部分。
点击 Download。
模型开始下载，下载完成后会显示 "Done"。
在左上角点击 Model 旁边的刷新图标。
在 Model 下拉菜单中选择刚刚下载的模型：Mistral-7B-Instruct-v0.1-GPTQ。
模型将自动加载，即可开始使用！
若需要自定义设置，设置完成后点击 Save settings for this model，然后点击右上角的 Reload the Model。
- 注意：无需手动设置 GPTQ 参数，这些参数会从 quantize_config.json 文件中自动加载。
准备就绪后，点击 Text Generation tab 并输入提示词即可开始！

从命令行下载

推荐使用 huggingface-hub Python 库进行下载：

pip3 install huggingface-hub

下载 main 分支到 Mistral-7B-Instruct-v0.1-GPTQ 文件夹：

mkdir Mistral-7B-Instruct-v0.1-GPTQ
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --local-dir Mistral-7B-Instruct-v0.1-GPTQ --local-dir-use-symlinks False

若要从不同分支下载，添加 --revision 参数：

mkdir Mistral-7B-Instruct-v0.1-GPTQ
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir Mistral-7B-Instruct-v0.1-GPTQ --local-dir-use-symlinks False

使用 `git` 下载（不推荐）

使用以下命令克隆特定分支：

git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GPTQ

不推荐使用 Git 下载，因为它比使用 huggingface-hub 慢，且会占用两倍的磁盘空间。

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

高级用法

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
# 使用特定分支
revision = "gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision=revision)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

# 使用 pipeline 进行推理
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

📚 详细文档

可用的模型仓库

提示词模板

<s>[INST] {prompt} [/INST]

提供的文件和 GPTQ 参数

提供了多种量化参数，用户可以根据硬件和需求选择最合适的参数。每个量化版本位于不同的分支，具体信息如下：

GPTQ 参数说明

Bits：量化模型的位大小。
GS：GPTQ 组大小。数值越大，使用的显存越少，但量化精度越低。"None" 是最低可能值。
Act Order：布尔值，也称为 desc_act。设置为 True 可提高量化精度。部分 GPTQ 客户端在使用 Act Order 和 Group Size 时可能会遇到问题，但目前这个问题已基本解决。
Damp %：影响量化样本处理的 GPTQ 参数。默认值为 0.01，但设置为 0.1 可略微提高精度。
GPTQ dataset：量化过程中使用的校准数据集。使用与模型训练更匹配的数据集可以提高量化精度。请注意，GPTQ 校准数据集与模型训练使用的数据集不同，请参考原始模型仓库了解训练数据集的详细信息。
Sequence Length：量化过程中使用的数据集序列长度。理想情况下，该长度应与模型序列长度相同。对于一些超长序列模型（16K 以上），可能需要使用较短的序列长度。请注意，较短的序列长度不会限制量化模型的序列长度，只会影响长推理序列的量化精度。
ExLlama Compatibility：该文件是否可以使用 ExLlama 加载，目前 ExLlama 仅支持 4 位的 Llama 模型。

分支	位	组大小	Act Order	Damp %	GPTQ 数据集	序列长度	大小	ExLlama 兼容性	描述
main	4	128	是	0.1	wikitext	32768	4.16 GB	是	4 位，启用 Act Order，组大小为 128g。比 64g 更节省显存，但精度略低。
gptq-4bit-32g-actorder_True	4	32	是	0.1	wikitext	32768	4.57 GB	是	4 位，启用 Act Order，组大小为 32g。可提供最高的推理质量，但显存使用量最大。
gptq-8bit-128g-actorder_True	8	128	是	0.1	wikitext	32768	7.68 GB	是	8 位，组大小为 128g，启用 Act Order 以提高推理质量和精度。
gptq-8bit-32g-actorder_True	8	32	是	0.1	wikitext	32768	8.17 GB	是	8 位，组大小为 32g，启用 Act Order 以提供最高的推理质量。

模型信息表格

属性	详情
模型类型	Mistral
训练数据	请参考原始模型仓库 Mistral 7B Instruct v0.1 了解训练数据集的详细信息
模型创建者	Mistral AI
原始模型	Mistral 7B Instruct v0.1
量化者	TheBloke
许可证	Apache-2.0