Dolphin 2.7 Mixtral 8X7B开源大模型 - 免费部署实现高效代码生成和指令跟随

首页

Dolphin 2.7 Mixtral 8x7b AWQ

由 TheBloke 开发

Dolphin 2.7 Mixtral 8X7B 是一个基于Mixtral架构的大型语言模型，专注于代码生成和指令跟随任务。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #多专家混合 #长文本理解 #代码生成

下载量 5,839

发布时间 : 1/1/2024

模型简介

该模型是基于Mixtral 8x7B架构的变体，经过多个高质量数据集训练，擅长代码生成和通用指令跟随任务。

模型特点

高效量化

支持AWQ 4位量化，在保持高质量的同时提高推理速度

多专家架构

采用8x7B混合专家模型架构，能够高效处理不同任务

代码生成能力

经过代码相关数据集训练，具备优秀的代码生成和理解能力

模型能力

文本生成

代码生成

指令跟随

问题解答

使用案例

编程辅助

代码自动补全

帮助开发者快速生成代码片段

代码解释

解释复杂代码的功能和逻辑

内容创作

技术文档撰写

自动生成技术文档和说明

🚀 Dolphin 2.7 Mixtral 8X7B - AWQ

Dolphin 2.7 Mixtral 8X7B - AWQ 是一款经过量化处理的模型，基于 Cognitive Computations 的 Dolphin 2.7 Mixtral 8X7B 模型。它采用了高效的 AWQ 量化方法，在保证一定质量的前提下，可实现更快速的推理。该模型适用于多种推理场景，如文本生成、问答系统等。

🚀 快速开始

本仓库包含 Cognitive Computations 的 Dolphin 2.7 Mixtral 8X7B 的 AWQ 模型文件。这些文件是使用 Massed Compute 慷慨提供的硬件进行量化的。

✨ 主要特性

高效量化：AWQ 是一种高效、准确且极快的低比特权重量化方法，目前支持 4 比特量化。与 GPTQ 相比，它在基于 Transformers 的推理中速度更快，并且在质量上与最常用的 GPTQ 设置相当或更好。
多平台支持：AWQ 模型目前支持 Linux 和 Windows，仅适用于 NVIDIA GPU。macOS 用户请使用 GGUF 模型。
多工具兼容：支持多种推理工具，如 Text Generation Webui、vLLM、Hugging Face Text Generation Inference (TGI)、Transformers 和 AutoAWQ。

📦 安装指南

安装 AutoAWQ 进行推理

对于 AutoAWQ 推理，请安装 AutoAWQ 0.1.8 或更高版本。

pip3 install autoawq

通过 Transformers 支持

也可以通过 Transformers 进行支持，但目前需要从 Github 安装 Transformers：

pip3 install git+https://github.com/huggingface/transformers.git

vLLM 支持

确认 vLLM 版本 0.2.6 支持 Mixtral AWQ 模型。

pip3 install vllm

TGI 支持

测试了版本 1.3.3，模型可以加载，但无法获得输出，需要进一步测试/调试。

docker pull ghcr.io/huggingface/text-generation-inference:1.3.3

💻 使用示例

在 text-generation-webui 中轻松下载和使用此模型

请确保使用 text-generation-webui 的最新版本。强烈建议使用 text-generation-webui 的一键安装程序，除非你确定知道如何手动安装。

点击 Model tab。
在 Download custom model or LoRA 下，输入 TheBloke/dolphin-2.7-mixtral-8x7b-AWQ。
点击 Download。
模型将开始下载。下载完成后会显示 "Done"。
在左上角，点击 Model 旁边的刷新图标。
在 Model 下拉菜单中，选择你刚刚下载的模型：dolphin-2.7-mixtral-8x7b-AWQ。
选择 Loader: AutoAWQ。
点击 Load，模型将加载并准备好使用。
如果你需要任何自定义设置，请进行设置，然后点击 Save settings for this model，接着在右上角点击 Reload the Model。
准备好后，点击 Text Generation 标签，输入提示以开始！

多用户推理服务器：vLLM

文档可在此处找到。

请确保使用 vLLM 版本 0.2 或更高版本。
使用 vLLM 作为服务器时，传递 --quantization awq 参数。

python3 -m vllm.entrypoints.api_server --model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --quantization awq --dtype auto

从 Python 代码使用 vLLM 时，同样设置 quantization=awq。

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/dolphin-2.7-mixtral-8x7b-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

多用户推理服务器：Hugging Face Text Generation Inference (TGI)

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器为：ghcr.io/huggingface/text-generation-inference:1.1.0

--model-id TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

与 TGI 交互的示例 Python 代码（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

从 Python 代码使用 Transformers 进行推理

安装必要的包

需要：Transformers 4.35.0 或更高版本。
需要：AutoAWQ 0.1.6 或更高版本。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

注意，如果你使用的是 PyTorch 2.0.1，上述 AutoAWQ 命令将自动将你升级到 PyTorch 2.1.0。如果你使用的是 CUDA 11.8 并希望继续使用 PyTorch 2.0.1，请运行以下命令：

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

如果你在使用预构建的轮子安装 AutoAWQ 时遇到问题，请从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers 示例代码（需要 Transformers 4.35.0 及更高版本）

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/dolphin-2.7-mixtral-8x7b-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

📚 详细文档

关于 AWQ

AWQ 是一种高效、准确且极快的低比特权重量化方法，目前支持 4 比特量化。与 GPTQ 相比，它在基于 Transformers 的推理中速度更快，并且在质量上与最常用的 GPTQ 设置相当或更好。

AWQ 模型目前支持 Linux 和 Windows，仅适用于 NVIDIA GPU。macOS 用户请使用 GGUF 模型。

AWQ 模型受以下工具支持（请注意，并非所有这些工具都可能支持 Mixtral 模型 - 请参阅上文）：

Text Generation Webui - 使用 Loader: AutoAWQ
vLLM - 版本 0.2.2 或更高版本支持所有模型类型。
Hugging Face Text Generation Inference (TGI)
Transformers 版本 4.35.0 及更高版本，来自任何支持 Transformers 的代码或客户端
AutoAWQ - 用于从 Python 代码使用

可用的仓库

提示模板：ChatML

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

提供的文件和 AWQ 参数

目前仅发布 128g GEMM 模型。正在积极考虑添加组大小为 32 的模型和 GEMV 内核模型。

模型以分片的 safetensors 文件形式发布。

分支	比特数	组大小	AWQ 数据集	序列长度	大小
main	4	128	VMware Open Instruct	8192	24.65 GB

兼容性

提供的文件经过测试，可与以下工具配合使用：

text-generation-webui 使用 Loader: AutoAWQ。
vLLM 版本 0.2.0 及更高版本。
Hugging Face Text Generation Inference (TGI) 版本 1.1.0 及更高版本。
Transformers 版本 4.35.0 及更高版本。
AutoAWQ 版本 0.1.1 及更高版本。

🔧 技术细节

Dolphin 2.7 Mixtral 8X7B 是对 Dolphin-2.5/2.6 的重新训练版本，在 transformers 库中进行了修复，以测试其性能是否有所提升。该模型基于 Mixtral-8x7b，基础模型具有 32k 的上下文，作者对其进行了 16k 的微调。

训练过程使用了 qLoRA 和 Axolotl，在 4 个 A100 GPU 上进行了 3 天的训练，完成了 1.5 个 epoch。

📄 许可证

该模型的许可证为 apache-2.0。

其他信息

Discord

如需进一步支持，以及讨论这些模型和人工智能相关话题，请加入：TheBloke AI 的 Discord 服务器

感谢与贡献

感谢 chirper.ai 团队！感谢来自 gpus.llm-utils.org 的 Clay！

如果您能够并愿意做出贡献，将不胜感激，这将有助于作者继续提供更多模型，并开展新的人工智能项目。捐赠者将在任何 AI/LLM/模型问题和请求上获得优先支持，访问私人 Discord 房间以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特别感谢：Aemon Algiz。

Patreon 特别提及：Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros