Yarn-Mistral-7B-128k-AWQ開源語言模型 - 支持128k長上下文窗口對話交流

首頁

Yarn Mistral 7B 128k AWQ

由TheBloke開發

Yarn Mistral 7B 128K是一款針對長上下文優化的先進語言模型，通過YaRN擴展方法在長上下文數據上進一步預訓練，支持128k令牌的上下文窗口。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #128k長上下文 #高效推理優化 #英文文本生成

下載量 483

發布時間 : 11/2/2023

模型概述

基於Mistral-7B-v0.1擴展的語言模型，專門針對處理長上下文進行了優化，適用於需要處理超長文本的各種自然語言處理任務。

模型特點

超長上下文支持

支持128k令牌的上下文窗口，能夠處理超長文本內容。

高效量化

提供AWQ量化版本，在保持質量的同時提高推理效率。

優化預訓練

通過YaRN方法在長上下文數據上進行了1500步的額外預訓練。

模型能力

長文本生成

上下文理解

文本續寫

問答系統

使用案例

文檔處理

長文檔摘要

對超長文檔進行內容摘要和關鍵信息提取。

法律文檔分析

處理和分析複雜的法律合同和條款。

代碼處理

代碼庫分析

理解大型代碼庫的結構和功能。

🚀 Yarn Mistral 7B 128K - AWQ

Yarn Mistral 7B 128K - AWQ 是經過量化處理的模型文件，基於 NousResearch 的 Yarn Mistral 7B 128K 模型。AWQ 量化方法高效、準確且推理速度快，支持多種推理工具，為用戶提供了便捷的使用體驗。

🚀 快速開始

模型信息

屬性	詳情
模型創建者	NousResearch
原始模型	Yarn Mistral 7B 128K
模型類型	Mistral
訓練數據	emozilla/yarn-train-tokenized-16k-mistral
許可證	apache - 2.0
評估指標	困惑度（perplexity）
量化者	TheBloke
提示模板	`{prompt}`

模型倉庫

提示模板

{prompt}

✨ 主要特性

關於 AWQ

AWQ 是一種高效、準確且極快的低比特權重量化方法，目前支持 4 比特量化。與 GPTQ 相比，它在基於 Transformer 的推理中速度更快，並且在質量上與最常用的 GPTQ 設置相當或更好。

它得到以下工具的支持：

Text Generation Webui - 使用加載器：AutoAWQ
vLLM - 僅支持 Llama 和 Mistral 模型
Hugging Face Text Generation Inference (TGI)
AutoAWQ - 用於 Python 代碼調用

📦 安裝指南

在 text - generation - webui 中使用

請確保你使用的是 text - generation - webui 的最新版本。強烈建議使用 text - generation - webui 的一鍵安裝程序，除非你確定自己知道如何手動安裝。

點擊 Model 標籤。
在 Download custom model or LoRA 下，輸入 TheBloke/Yarn-Mistral-7B-128k-AWQ。
點擊 Download。
模型將開始下載。下載完成後會顯示 "Done"。
在左上角，點擊 Model 旁邊的刷新圖標。
在 Model 下拉菜單中，選擇你剛剛下載的模型：Yarn-Mistral-7B-128k-AWQ。
選擇 Loader: AutoAWQ。
點擊 Load，模型將加載並準備好使用。
如果你需要任何自定義設置，請設置它們，然後點擊右上角的 Save settings for this model，接著點擊 Reload the Model。
準備好後，點擊 Text Generation 標籤並輸入提示以開始使用！

使用 AutoAWQ 從 Python 代碼進行推理

安裝 AutoAWQ 包

需要 AutoAWQ 0.1.1 或更高版本。

pip3 install autoawq

如果你在使用預構建的輪子安裝 AutoAWQ 時遇到問題，請從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用示例

基礎用法

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/Yarn-Mistral-7B-128k-AWQ"

# 加載分詞器
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# 加載模型
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=True, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# 生成輸出
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# 獲取輸出的令牌，解碼並打印
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

高級用法

使用 vLLM 進行多用戶推理服務

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Yarn-Mistral-7B-128k-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# 打印輸出
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 Hugging Face Text Generation Inference (TGI) 進行多用戶推理服務

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

📚 詳細文檔

提供的文件和 AWQ 參數

在首次發佈 AWQ 模型時，僅發佈 128g 模型。如果有需求，並且在完成困惑度和評估比較後，會考慮添加 32g 模型。但目前 32g 模型尚未在 AutoAWQ 和 vLLM 中進行全面測試。

模型以分片的 safetensors 文件形式發佈。

分支	比特數	分組大小（GS）	AWQ 數據集	序列長度	大小
main	4	128	wikitext	4096	4.15 GB

兼容性

提供的文件經過測試，可與以下工具配合使用：

text - generation - webui，使用 Loader: AutoAWQ。
vLLM 版本 0.2.0 及更高版本。
Hugging Face Text Generation Inference (TGI) 版本 1.1.0 及更高版本。
AutoAWQ 版本 0.1.1 及更高版本。

🔧 技術細節

原始模型基準測試

長上下文基準測試

模型	上下文窗口	8k 困惑度	16k 困惑度	32k 困惑度	64k 困惑度	128k 困惑度
Mistral - 7B - v0.1	8k	2.96	-	-	-	-
Yarn - Mistral - 7b - 64k	64k	3.04	2.65	2.44	2.20	-
Yarn - Mistral - 7b - 128k	128k	3.08	2.68	2.47	2.24	2.19

短上下文基準測試（顯示質量下降最小）

模型	上下文窗口	ARC - c	Hellaswag	MMLU	Truthful QA
Mistral - 7B - v0.1	8k	59.98	83.31	64.16	42.15
Yarn - Mistral - 7b - 64k	64k	59.38	81.21	61.32	42.50
Yarn - Mistral - 7b - 128k	128k	58.87	80.58	60.64	42.46

協作人員

bloc97：方法、論文和評估
@theemozilla：方法、論文、模型訓練和評估
@EnricoShippole：模型訓練
honglu2875：論文和評估

作者感謝 LAION AI 對該模型計算資源的支持。該模型在 [JUWELS](https://www.fz - juelich.de/en/ias/jsc/systems/supercomputers/juwels) 超級計算機上進行訓練。

📄 許可證

本項目採用 apache - 2.0 許可證。

其他信息

Discord

如需進一步支持，以及討論這些模型和人工智能相關話題，請加入：TheBloke AI 的 Discord 服務器

感謝與貢獻方式

感謝 chirper.ai 團隊！感謝來自 [gpus.llm - utils.org](llm - utils) 的 Clay！

很多人詢問是否可以進行貢獻。我喜歡提供模型並幫助他人，也希望能夠花更多時間做這些事情，以及開展新的項目，如微調/訓練。

如果你有能力並願意貢獻，我將非常感激，這將有助於我繼續提供更多模型，並開始新的人工智能項目。

捐贈者將在所有 AI/LLM/模型問題和請求上獲得優先支持，訪問私人 Discord 房間，以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko - Fi: https://ko - fi.com/TheBlokeAI

特別感謝：Aemon Algiz。

Patreon 特別提及：Brandon Frisco、LangChain4j、Spiking Neurons AB、transmissions 11、Joseph William Delisle、Nitin Borwankar、Willem Michiel、Michael Dempsey、vamX、Jeffrey Morgan、zynix、jjj、Omer Bin Jawed、Sean Connelly、jinyuan sun、Jeromy Smith、Shadi、Pawan Osman、Chadd、Elijah Stavena、Illia Dulskyi、Sebastain Graf、Stephen Murray、terasurfer、Edmond Seymore、Celu Ramasamy、Mandus、Alex、biorpg、Ajan Kanaga、Clay Pascal、Raven Klaugh、阿明、K、ya boyyy、usrbinkat、Alicia Loh、John Villwock、ReadyPlayerEmma、Chris Smitley、Cap'n Zoog、fincy、GodLy、S_X、sidney chen、Cory Kujawski、OG、Mano Prime、AzureBlack、Pieter、Kalila、Spencer Kim、Tom X Nguyen、Stanislav Ovsiannikov、Michael Levine、Andrey、Trailburnt、Vadim、Enrico Ros、Talal Aujan、Brandon Phillips、Jack West、Eugene Pentland、Michael Davis、Will Dee、webtim、Jonathan Leane、Alps Aficionado、Rooh Singh、Tiffany J. Kim、theTransient、Luke @flexchar、Elle、Caitlyn Gatomon、Ari Malik、subjectnull、Johann - Peter Hartmann、Trenton Dambrowitz、Imad Khwaja、Asp the Wyvern、Emad Mostaque、Rainer Wilmers、Alexandros Triantafyllidis、Nicholas、Pedro Madruga、SuperWojo、Harry Royden McLaughlin、James Bentley、Olakabola、David Ziegler、Ai Maven、Jeff Scroggin、Nikolai Manek、Deo Leter、Matthew Berman、Fen Risland、Ken Nordquist、Manuel Alberto Morcote、Luke Pendergrass、TL、Fred von Graf、Randy H、Dan Guido、NimbleBox.ai、Vitor Caleffi、Gabriel Tamborski、knownsqashed、Lone Striker、Erik Bjäreholt、John Detwiler、Leonard Tan、Iucharbius