Dolphin 2.7 Mixtral 8X7B開源大模型 - 免費部署實現高效代碼生成和指令跟隨

首頁

Dolphin 2.7 Mixtral 8x7b AWQ

由TheBloke開發

Dolphin 2.7 Mixtral 8X7B 是一個基於Mixtral架構的大型語言模型，專注於代碼生成和指令跟隨任務。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #多專家混合 #長文本理解 #代碼生成

下載量 5,839

發布時間 : 1/1/2024

模型概述

該模型是基於Mixtral 8x7B架構的變體，經過多個高質量數據集訓練，擅長代碼生成和通用指令跟隨任務。

模型特點

高效量化

支持AWQ 4位量化，在保持高質量的同時提高推理速度

多專家架構

採用8x7B混合專家模型架構，能夠高效處理不同任務

代碼生成能力

經過代碼相關數據集訓練，具備優秀的代碼生成和理解能力

模型能力

文本生成

代碼生成

指令跟隨

問題解答

使用案例

編程輔助

代碼自動補全

幫助開發者快速生成代碼片段

代碼解釋

解釋複雜代碼的功能和邏輯

內容創作

技術文檔撰寫

自動生成技術文檔和說明

🚀 Dolphin 2.7 Mixtral 8X7B - AWQ

Dolphin 2.7 Mixtral 8X7B - AWQ 是一款經過量化處理的模型，基於 Cognitive Computations 的 Dolphin 2.7 Mixtral 8X7B 模型。它採用了高效的 AWQ 量化方法，在保證一定質量的前提下，可實現更快速的推理。該模型適用於多種推理場景，如文本生成、問答系統等。

🚀 快速開始

本倉庫包含 Cognitive Computations 的 Dolphin 2.7 Mixtral 8X7B 的 AWQ 模型文件。這些文件是使用 Massed Compute 慷慨提供的硬件進行量化的。

✨ 主要特性

高效量化：AWQ 是一種高效、準確且極快的低比特權重量化方法，目前支持 4 比特量化。與 GPTQ 相比，它在基於 Transformers 的推理中速度更快，並且在質量上與最常用的 GPTQ 設置相當或更好。
多平臺支持：AWQ 模型目前支持 Linux 和 Windows，僅適用於 NVIDIA GPU。macOS 用戶請使用 GGUF 模型。
多工具兼容：支持多種推理工具，如 Text Generation Webui、vLLM、Hugging Face Text Generation Inference (TGI)、Transformers 和 AutoAWQ。

📦 安裝指南

安裝 AutoAWQ 進行推理

對於 AutoAWQ 推理，請安裝 AutoAWQ 0.1.8 或更高版本。

pip3 install autoawq

通過 Transformers 支持

也可以通過 Transformers 進行支持，但目前需要從 Github 安裝 Transformers：

pip3 install git+https://github.com/huggingface/transformers.git

vLLM 支持

確認 vLLM 版本 0.2.6 支持 Mixtral AWQ 模型。

pip3 install vllm

TGI 支持

測試了版本 1.3.3，模型可以加載，但無法獲得輸出，需要進一步測試/調試。

docker pull ghcr.io/huggingface/text-generation-inference:1.3.3

💻 使用示例

在 text-generation-webui 中輕鬆下載和使用此模型

請確保使用 text-generation-webui 的最新版本。強烈建議使用 text-generation-webui 的一鍵安裝程序，除非你確定知道如何手動安裝。

點擊 Model tab。
在 Download custom model or LoRA 下，輸入 TheBloke/dolphin-2.7-mixtral-8x7b-AWQ。
點擊 Download。
模型將開始下載。下載完成後會顯示 "Done"。
在左上角，點擊 Model 旁邊的刷新圖標。
在 Model 下拉菜單中，選擇你剛剛下載的模型：dolphin-2.7-mixtral-8x7b-AWQ。
選擇 Loader: AutoAWQ。
點擊 Load，模型將加載並準備好使用。
如果你需要任何自定義設置，請進行設置，然後點擊 Save settings for this model，接著在右上角點擊 Reload the Model。
準備好後，點擊 Text Generation 標籤，輸入提示以開始！

多用戶推理服務器：vLLM

文檔可在此處找到。

請確保使用 vLLM 版本 0.2 或更高版本。
使用 vLLM 作為服務器時，傳遞 --quantization awq 參數。

python3 -m vllm.entrypoints.api_server --model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --quantization awq --dtype auto

從 Python 代碼使用 vLLM 時，同樣設置 quantization=awq。

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/dolphin-2.7-mixtral-8x7b-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

多用戶推理服務器：Hugging Face Text Generation Inference (TGI)

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器為：ghcr.io/huggingface/text-generation-inference:1.1.0

--model-id TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

與 TGI 交互的示例 Python 代碼（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

從 Python 代碼使用 Transformers 進行推理

安裝必要的包

需要：Transformers 4.35.0 或更高版本。
需要：AutoAWQ 0.1.6 或更高版本。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

注意，如果你使用的是 PyTorch 2.0.1，上述 AutoAWQ 命令將自動將你升級到 PyTorch 2.1.0。如果你使用的是 CUDA 11.8 並希望繼續使用 PyTorch 2.0.1，請運行以下命令：

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

如果你在使用預構建的輪子安裝 AutoAWQ 時遇到問題，請從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers 示例代碼（需要 Transformers 4.35.0 及更高版本）

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/dolphin-2.7-mixtral-8x7b-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

📚 詳細文檔

關於 AWQ

AWQ 是一種高效、準確且極快的低比特權重量化方法，目前支持 4 比特量化。與 GPTQ 相比，它在基於 Transformers 的推理中速度更快，並且在質量上與最常用的 GPTQ 設置相當或更好。

AWQ 模型目前支持 Linux 和 Windows，僅適用於 NVIDIA GPU。macOS 用戶請使用 GGUF 模型。

AWQ 模型受以下工具支持（請注意，並非所有這些工具都可能支持 Mixtral 模型 - 請參閱上文）：

Text Generation Webui - 使用 Loader: AutoAWQ
vLLM - 版本 0.2.2 或更高版本支持所有模型類型。
Hugging Face Text Generation Inference (TGI)
Transformers 版本 4.35.0 及更高版本，來自任何支持 Transformers 的代碼或客戶端
AutoAWQ - 用於從 Python 代碼使用

可用的倉庫

提示模板：ChatML

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

提供的文件和 AWQ 參數

目前僅發佈 128g GEMM 模型。正在積極考慮添加組大小為 32 的模型和 GEMV 內核模型。

模型以分片的 safetensors 文件形式發佈。

分支	比特數	組大小	AWQ 數據集	序列長度	大小
main	4	128	VMware Open Instruct	8192	24.65 GB

兼容性

提供的文件經過測試，可與以下工具配合使用：

text-generation-webui 使用 Loader: AutoAWQ。
vLLM 版本 0.2.0 及更高版本。
Hugging Face Text Generation Inference (TGI) 版本 1.1.0 及更高版本。
Transformers 版本 4.35.0 及更高版本。
AutoAWQ 版本 0.1.1 及更高版本。

🔧 技術細節

Dolphin 2.7 Mixtral 8X7B 是對 Dolphin-2.5/2.6 的重新訓練版本，在 transformers 庫中進行了修復，以測試其性能是否有所提升。該模型基於 Mixtral-8x7b，基礎模型具有 32k 的上下文，作者對其進行了 16k 的微調。

訓練過程使用了 qLoRA 和 Axolotl，在 4 個 A100 GPU 上進行了 3 天的訓練，完成了 1.5 個 epoch。

📄 許可證

該模型的許可證為 apache-2.0。

其他信息

Discord

如需進一步支持，以及討論這些模型和人工智能相關話題，請加入：TheBloke AI 的 Discord 服務器

感謝與貢獻

感謝 chirper.ai 團隊！感謝來自 gpus.llm-utils.org 的 Clay！

如果您能夠並願意做出貢獻，將不勝感激，這將有助於作者繼續提供更多模型，並開展新的人工智能項目。捐贈者將在任何 AI/LLM/模型問題和請求上獲得優先支持，訪問私人 Discord 房間以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特別感謝：Aemon Algiz。

Patreon 特別提及：Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros