Mistral-7B-Instruct-v0.1-GPTQ開源模型 - 支持兩框架運行，高效處理各類任務

首頁

Mistral 7B Instruct V0.1 GPTQ

由TheBloke開發

Mistral 7B Instruct v0.1 的 GPTQ 量化版本，支持在 ExLlama 或 Transformers 框架下運行

大型語言模型

Transformers

開源協議:Apache-2.0 #指令微調 #4/8位量化 #長序列處理

下載量 7,481

發布時間 : 9/28/2023

模型概述

這是一個基於 Mistral 7B Instruct v0.1 的 GPTQ 量化模型，提供了多種量化參數選擇，適用於不同硬件環境下的推理需求。

模型特點

多量化參數支持

提供多種量化參數組合，用戶可根據硬件和需求選擇最合適的參數

多框架兼容

模型可以在 ExLlama 或 Transformers 框架下運行

高效推理

通過 GPTQ 量化技術減少模型大小和內存佔用，同時保持較高的推理質量

長序列支持

支持長達 32768 的序列長度

模型能力

指令跟隨

文本生成

對話系統

問答系統

使用案例

對話系統

智能助手

構建能夠理解並響應自然語言指令的智能助手

內容生成

文章創作

根據提示生成連貫、有邏輯的文章內容

問答系統

知識問答

回答用戶提出的各種知識性問題

🚀 Mistral 7B Instruct v0.1 - GPTQ

本項目提供了 Mistral AI 公司的 Mistral 7B Instruct v0.1 模型的 GPTQ 量化版本。該模型可以在 ExLlama 或 Transformers 框架下運行，滿足不同用戶的推理需求。

🚀 快速開始

環境準備

若要使用該模型，你需要安裝以下依賴：

pip3 install optimum
pip3 install git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # 若使用 CUDA 11.7，將 cu118 替換為 cu117

若在安裝 AutoGPTQ 時遇到問題，可以從源碼進行安裝：

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.4.2
pip3 install .

代碼示例

以下是一個使用 Python 調用該模型的示例：

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
# 若要使用不同分支，修改 revision 參數
# 例如：revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# 也可以使用 transformers 的 pipeline 進行推理
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

✨ 主要特性

多量化參數支持：提供了多種量化參數組合，用戶可以根據自身硬件和需求選擇最合適的參數。
多框架兼容：模型可以在 ExLlama 或 Transformers 框架下運行。
多分支選擇：每個量化版本都位於不同的分支，方便用戶根據需求選擇。

📦 安裝指南

在 text-generation-webui 中下載

點擊 Model tab。
在 Download custom model or LoRA 中輸入 TheBloke/Mistral-7B-Instruct-v0.1-GPTQ。
- 若要從特定分支下載，可在後面添加 :branchname，例如 TheBloke/Mistral-7B-Instruct-v0.1-GPTQ:gptq-4bit-32g-actorder_True。
- 具體分支列表可參考下文的 提供的文件和 GPTQ 參數 部分。
點擊 Download。
模型開始下載，下載完成後會顯示 "Done"。
在左上角點擊 Model 旁邊的刷新圖標。
在 Model 下拉菜單中選擇剛剛下載的模型：Mistral-7B-Instruct-v0.1-GPTQ。
模型將自動加載，即可開始使用！
若需要自定義設置，設置完成後點擊 Save settings for this model，然後點擊右上角的 Reload the Model。
- 注意：無需手動設置 GPTQ 參數，這些參數會從 quantize_config.json 文件中自動加載。
準備就緒後，點擊 Text Generation tab 並輸入提示詞即可開始！

從命令行下載

推薦使用 huggingface-hub Python 庫進行下載：

pip3 install huggingface-hub

下載 main 分支到 Mistral-7B-Instruct-v0.1-GPTQ 文件夾：

mkdir Mistral-7B-Instruct-v0.1-GPTQ
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --local-dir Mistral-7B-Instruct-v0.1-GPTQ --local-dir-use-symlinks False

若要從不同分支下載，添加 --revision 參數：

mkdir Mistral-7B-Instruct-v0.1-GPTQ
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir Mistral-7B-Instruct-v0.1-GPTQ --local-dir-use-symlinks False

使用 `git` 下載（不推薦）

使用以下命令克隆特定分支：

git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GPTQ

不推薦使用 Git 下載，因為它比使用 huggingface-hub 慢，且會佔用兩倍的磁盤空間。

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

高級用法

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
# 使用特定分支
revision = "gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision=revision)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<s>[INST] {prompt} [/INST]
'''

# 使用 pipeline 進行推理
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

📚 詳細文檔

可用的模型倉庫

提示詞模板

<s>[INST] {prompt} [/INST]

提供的文件和 GPTQ 參數

提供了多種量化參數，用戶可以根據硬件和需求選擇最合適的參數。每個量化版本位於不同的分支，具體信息如下：

GPTQ 參數說明

Bits：量化模型的位大小。
GS：GPTQ 組大小。數值越大，使用的顯存越少，但量化精度越低。"None" 是最低可能值。
Act Order：布爾值，也稱為 desc_act。設置為 True 可提高量化精度。部分 GPTQ 客戶端在使用 Act Order 和 Group Size 時可能會遇到問題，但目前這個問題已基本解決。
Damp %：影響量化樣本處理的 GPTQ 參數。默認值為 0.01，但設置為 0.1 可略微提高精度。
GPTQ dataset：量化過程中使用的校準數據集。使用與模型訓練更匹配的數據集可以提高量化精度。請注意，GPTQ 校準數據集與模型訓練使用的數據集不同，請參考原始模型倉庫瞭解訓練數據集的詳細信息。
Sequence Length：量化過程中使用的數據集序列長度。理想情況下，該長度應與模型序列長度相同。對於一些超長序列模型（16K 以上），可能需要使用較短的序列長度。請注意，較短的序列長度不會限制量化模型的序列長度，只會影響長推理序列的量化精度。
ExLlama Compatibility：該文件是否可以使用 ExLlama 加載，目前 ExLlama 僅支持 4 位的 Llama 模型。

分支	位	組大小	Act Order	Damp %	GPTQ 數據集	序列長度	大小	ExLlama 兼容性	描述
main	4	128	是	0.1	wikitext	32768	4.16 GB	是	4 位，啟用 Act Order，組大小為 128g。比 64g 更節省顯存，但精度略低。
gptq-4bit-32g-actorder_True	4	32	是	0.1	wikitext	32768	4.57 GB	是	4 位，啟用 Act Order，組大小為 32g。可提供最高的推理質量，但顯存使用量最大。
gptq-8bit-128g-actorder_True	8	128	是	0.1	wikitext	32768	7.68 GB	是	8 位，組大小為 128g，啟用 Act Order 以提高推理質量和精度。
gptq-8bit-32g-actorder_True	8	32	是	0.1	wikitext	32768	8.17 GB	是	8 位，組大小為 32g，啟用 Act Order 以提供最高的推理質量。

模型信息表格

屬性	詳情
模型類型	Mistral
訓練數據	請參考原始模型倉庫 Mistral 7B Instruct v0.1 瞭解訓練數據集的詳細信息
模型創建者	Mistral AI
原始模型	Mistral 7B Instruct v0.1
量化者	TheBloke
許可證	Apache-2.0