DaringMaid-20B-GGUF开源大语言模型 - 免费助力高质量文本生成

首页

Daringmaid 20B GGUF

由 TheBloke 开发

DaringMaid 20B是一个基于Llama 2架构的大语言模型，由Kooten开发，专注于文本生成任务。

大型语言模型英语#大语言模型 #文本生成 #多轮对话

下载量 1,003

发布时间 : 12/20/2023

模型简介

DaringMaid 20B是一个20B参数规模的大语言模型，适用于多种文本生成任务，支持英语。

模型特点

高效量化

提供多种量化版本，从2位到8位，适应不同硬件需求。

广泛兼容

支持多种客户端和库，包括llama.cpp、text-generation-webui等。

高质量文本生成

基于20B参数规模，能够生成高质量的文本内容。

模型能力

文本生成

指令跟随

故事创作

使用案例

内容创作

故事生成

生成连贯且富有创意的故事内容。

指令响应

根据用户指令生成恰当的文本回应。

教育

学习辅助

生成学习材料或解答学习相关问题。

🚀 DaringMaid 20B - GGUF

本项目提供了 Kooten的DaringMaid 20B 模型的GGUF格式文件，方便用户进行推理和使用。这些量化文件由 Massed Compute 提供的硬件支持生成。

🚀 快速开始

你可以根据自己的需求选择合适的量化模型文件进行下载和使用。以下是一些常见的客户端和库，它们可以自动为你下载模型：

LM Studio
LoLLMS Web UI
Faraday.dev

在 text-generation-webui 中，你可以在 Download Model 下输入模型仓库地址 TheBloke/DaringMaid-20B-GGUF，并指定具体的文件名进行下载，例如 daringmaid-20b.Q4_K_M.gguf，然后点击 Download。

在命令行中，你可以使用 huggingface-hub Python 库进行下载：

pip3 install huggingface-hub
huggingface-cli download TheBloke/DaringMaid-20B-GGUF daringmaid-20b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

✨ 主要特性

多种量化格式：提供了2、3、4、5、6和8位的GGUF模型，适用于CPU+GPU推理。
广泛的兼容性：兼容从2023年8月27日起的llama.cpp，以及许多第三方UI和库。
支持多种客户端：支持 llama.cpp、text-generation-webui、KoboldCpp、GPT4All 等多种客户端和库。

📦 安装指南

下载GGUF文件

你可以使用以下方法下载GGUF文件：

自动下载：使用LM Studio、LoLLMS Web UI、Faraday.dev等客户端或库，它们会提供可用模型列表供你选择。
手动下载：不建议克隆整个仓库，因为提供了多种不同的量化格式，大多数用户只需要选择并下载单个文件。

在 text-generation-webui 中下载：在 Download Model 下输入模型仓库地址 TheBloke/DaringMaid-20B-GGUF，并指定具体的文件名进行下载，例如 daringmaid-20b.Q4_K_M.gguf，然后点击 Download。

在命令行中下载：

pip3 install huggingface-hub
huggingface-cli download TheBloke/DaringMaid-20B-GGUF daringmaid-20b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

安装依赖库

如果你想在Python代码中使用该模型，需要安装 llama-cpp-python 或 ctransformers 库。推荐使用 llama-cpp-python：

# Base ctransformers with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python

💻 使用示例

基础用法

在 `llama.cpp` 中运行

确保你使用的是2023年8月27日之后的 llama.cpp 版本：

./main -ngl 35 -m daringmaid-20b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:"

-ngl 35：将35层模型卸载到GPU，如果你没有GPU加速，可以移除该参数。
-c 4096：设置所需的序列长度，更长的序列长度需要更多的资源。

在Python代码中使用 `llama-cpp-python`

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="./daringmaid-20b.Q4_K_M.gguf",  # Download the model file first
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

# Simple inference example
output = llm(
  "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:", # Prompt
  max_tokens=512,  # Generate up to 512 tokens
  stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
  echo=True        # Whether to echo the prompt
)

# Chat Completion API
llm = Llama(model_path="./daringmaid-20b.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "Write a story about llamas."
        }
    ]
)

高级用法

在 text-generation-webui 中运行：更多说明可以在 text-generation-webui 文档中找到：text-generation-webui/docs/04 ‐ Model Tab.md。

📚 详细文档

关于GGUF

GGUF是llama.cpp团队在2023年8月21日引入的一种新格式，它取代了不再受llama.cpp支持的GGML格式。以下是一些已知支持GGUF的客户端和库：

llama.cpp：GGUF的源项目，提供CLI和服务器选项。
text-generation-webui：最广泛使用的Web UI，具有许多功能和强大的扩展，支持GPU加速。
KoboldCpp：功能齐全的Web UI，支持所有平台和GPU架构的GPU加速，特别适合讲故事。
GPT4All：免费开源的本地运行GUI，支持Windows、Linux和macOS，具有完整的GPU加速。
LM Studio：易于使用且功能强大的本地GUI，适用于Windows和macOS（Silicon），支持GPU加速，Linux版本截至2023年11月27日处于测试阶段。
LoLLMS Web UI：一个很棒的Web UI，具有许多有趣和独特的功能，包括一个完整的模型库，方便选择模型。
Faraday.dev：一个有吸引力且易于使用的基于角色的聊天GUI，适用于Windows和macOS（Silicon和Intel），支持GPU加速。
llama-cpp-python：一个具有GPU加速、LangChain支持和OpenAI兼容API服务器的Python库。
candle：一个注重性能的Rust ML框架，包括GPU支持和易用性。
ctransformers：一个具有GPU加速、LangChain支持和OpenAI兼容AI服务器的Python库。截至2023年11月27日，ctransformers 已经很长时间没有更新，不支持许多最近的模型。

可用的仓库

提示模板

本模型使用Alpaca提示模板：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:

🔧 技术细节

量化方法说明

点击查看详情

新的量化方法如下：

GGML_TYPE_Q2_K：“type-1” 2位量化，超级块包含16个块，每个块有16个权重。块的缩放和最小值用4位量化，最终每个权重有效使用2.5625位（bpw）。
GGML_TYPE_Q3_K：“type-0” 3位量化，超级块包含16个块，每个块有16个权重。缩放用6位量化，最终使用3.4375 bpw。
GGML_TYPE_Q4_K：“type-1” 4位量化，超级块包含8个块，每个块有32个权重。缩放和最小值用6位量化，最终使用4.5 bpw。
GGML_TYPE_Q5_K：“type-1” 5位量化，与GGML_TYPE_Q4_K具有相同的超级块结构，最终使用5.5 bpw。
GGML_TYPE_Q6_K：“type-0” 6位量化，超级块包含16个块，每个块有16个权重。缩放用8位量化，最终使用6.5625 bpw。

请参考下面的 提供的文件 表格，查看哪些文件使用了哪些方法以及如何使用。

提供的文件

名称	量化方法	位数	大小	所需最大RAM	使用场景
daringmaid-20b.Q2_K.gguf	Q2_K	2	8.31 GB	10.81 GB	最小，但有显著的质量损失，不建议用于大多数用途
daringmaid-20b.Q3_K_S.gguf	Q3_K_S	3	8.66 GB	11.16 GB	非常小，但有较高的质量损失
daringmaid-20b.Q3_K_M.gguf	Q3_K_M	3	9.70 GB	12.20 GB	非常小，但有较高的质量损失
daringmaid-20b.Q3_K_L.gguf	Q3_K_L	3	10.63 GB	13.13 GB	小，但有显著的质量损失
daringmaid-20b.Q4_0.gguf	Q4_0	4	11.29 GB	13.79 GB	旧版；小，但有非常高的质量损失，建议使用Q3_K_M
daringmaid-20b.Q4_K_S.gguf	Q4_K_S	4	11.34 GB	13.84 GB	小，但有较大的质量损失
daringmaid-20b.Q4_K_M.gguf	Q4_K_M	4	12.04 GB	14.54 GB	中等，质量平衡，推荐使用
daringmaid-20b.Q5_0.gguf	Q5_0	5	13.77 GB	16.27 GB	旧版；中等，质量平衡，建议使用Q4_K_M
daringmaid-20b.Q5_K_S.gguf	Q5_K_S	5	13.77 GB	16.27 GB	大，质量损失低，推荐使用
daringmaid-20b.Q5_K_M.gguf	Q5_K_M	5	14.16 GB	16.66 GB	大，质量损失非常低，推荐使用
daringmaid-20b.Q6_K.gguf	Q6_K	6	16.41 GB	18.91 GB	非常大，质量损失极低
daringmaid-20b.Q8_0.gguf	Q8_0	8	21.25 GB	23.75 GB	非常大，质量损失极低，但不建议使用

注意：上述RAM数字假设没有进行GPU卸载。如果将层卸载到GPU，这将减少RAM使用并使用VRAM。

📄 许可证

源模型的创建者将其许可证列为 cc-by-nc-4.0，因此本次量化也使用了相同的许可证。

由于该模型基于Llama 2，它也受Meta Llama 2许可证条款的约束，并且额外包含了该许可证文件。因此，应认为该模型声称同时受这两个许可证的约束。我已联系Hugging Face以澄清双重许可问题，但他们尚未有官方立场。如果情况发生变化，或者Meta对此情况提供任何反馈，我将相应更新此部分。

在此期间，任何关于许可证的问题，特别是这两个许可证如何相互作用的问题，应直接咨询原始模型仓库：Kooten的DaringMaid 20B。