DaringMaid-20B-GGUF開源大語言模型 - 免費助力高質量文本生成

首頁

Daringmaid 20B GGUF

由TheBloke開發

DaringMaid 20B是一個基於Llama 2架構的大語言模型，由Kooten開發，專注於文本生成任務。

大型語言模型英語#大語言模型 #文本生成 #多輪對話

下載量 1,003

發布時間 : 12/20/2023

模型概述

DaringMaid 20B是一個20B參數規模的大語言模型，適用於多種文本生成任務，支持英語。

模型特點

高效量化

提供多種量化版本，從2位到8位，適應不同硬件需求。

廣泛兼容

支持多種客戶端和庫，包括llama.cpp、text-generation-webui等。

高質量文本生成

基於20B參數規模，能夠生成高質量的文本內容。

模型能力

文本生成

指令跟隨

故事創作

使用案例

內容創作

故事生成

生成連貫且富有創意的故事內容。

指令響應

根據用戶指令生成恰當的文本回應。

教育

學習輔助

生成學習材料或解答學習相關問題。

🚀 DaringMaid 20B - GGUF

本項目提供了 Kooten的DaringMaid 20B 模型的GGUF格式文件，方便用戶進行推理和使用。這些量化文件由 Massed Compute 提供的硬件支持生成。

🚀 快速開始

你可以根據自己的需求選擇合適的量化模型文件進行下載和使用。以下是一些常見的客戶端和庫，它們可以自動為你下載模型：

LM Studio
LoLLMS Web UI
Faraday.dev

在 text-generation-webui 中，你可以在 Download Model 下輸入模型倉庫地址 TheBloke/DaringMaid-20B-GGUF，並指定具體的文件名進行下載，例如 daringmaid-20b.Q4_K_M.gguf，然後點擊 Download。

在命令行中，你可以使用 huggingface-hub Python 庫進行下載：

pip3 install huggingface-hub
huggingface-cli download TheBloke/DaringMaid-20B-GGUF daringmaid-20b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

✨ 主要特性

多種量化格式：提供了2、3、4、5、6和8位的GGUF模型，適用於CPU+GPU推理。
廣泛的兼容性：兼容從2023年8月27日起的llama.cpp，以及許多第三方UI和庫。
支持多種客戶端：支持 llama.cpp、text-generation-webui、KoboldCpp、GPT4All 等多種客戶端和庫。

📦 安裝指南

下載GGUF文件

你可以使用以下方法下載GGUF文件：

自動下載：使用LM Studio、LoLLMS Web UI、Faraday.dev等客戶端或庫，它們會提供可用模型列表供你選擇。
手動下載：不建議克隆整個倉庫，因為提供了多種不同的量化格式，大多數用戶只需要選擇並下載單個文件。

在 text-generation-webui 中下載：在 Download Model 下輸入模型倉庫地址 TheBloke/DaringMaid-20B-GGUF，並指定具體的文件名進行下載，例如 daringmaid-20b.Q4_K_M.gguf，然後點擊 Download。

在命令行中下載：

pip3 install huggingface-hub
huggingface-cli download TheBloke/DaringMaid-20B-GGUF daringmaid-20b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

安裝依賴庫

如果你想在Python代碼中使用該模型，需要安裝 llama-cpp-python 或 ctransformers 庫。推薦使用 llama-cpp-python：

# Base ctransformers with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python

💻 使用示例

基礎用法

在 `llama.cpp` 中運行

確保你使用的是2023年8月27日之後的 llama.cpp 版本：

./main -ngl 35 -m daringmaid-20b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:"

-ngl 35：將35層模型卸載到GPU，如果你沒有GPU加速，可以移除該參數。
-c 4096：設置所需的序列長度，更長的序列長度需要更多的資源。

在Python代碼中使用 `llama-cpp-python`

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="./daringmaid-20b.Q4_K_M.gguf",  # Download the model file first
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

# Simple inference example
output = llm(
  "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:", # Prompt
  max_tokens=512,  # Generate up to 512 tokens
  stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
  echo=True        # Whether to echo the prompt
)

# Chat Completion API
llm = Llama(model_path="./daringmaid-20b.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "Write a story about llamas."
        }
    ]
)

高級用法

在 text-generation-webui 中運行：更多說明可以在 text-generation-webui 文檔中找到：text-generation-webui/docs/04 ‐ Model Tab.md。

📚 詳細文檔

關於GGUF

GGUF是llama.cpp團隊在2023年8月21日引入的一種新格式，它取代了不再受llama.cpp支持的GGML格式。以下是一些已知支持GGUF的客戶端和庫：

llama.cpp：GGUF的源項目，提供CLI和服務器選項。
text-generation-webui：最廣泛使用的Web UI，具有許多功能和強大的擴展，支持GPU加速。
KoboldCpp：功能齊全的Web UI，支持所有平臺和GPU架構的GPU加速，特別適合講故事。
GPT4All：免費開源的本地運行GUI，支持Windows、Linux和macOS，具有完整的GPU加速。
LM Studio：易於使用且功能強大的本地GUI，適用於Windows和macOS（Silicon），支持GPU加速，Linux版本截至2023年11月27日處於測試階段。
LoLLMS Web UI：一個很棒的Web UI，具有許多有趣和獨特的功能，包括一個完整的模型庫，方便選擇模型。
Faraday.dev：一個有吸引力且易於使用的基於角色的聊天GUI，適用於Windows和macOS（Silicon和Intel），支持GPU加速。
llama-cpp-python：一個具有GPU加速、LangChain支持和OpenAI兼容API服務器的Python庫。
candle：一個注重性能的Rust ML框架，包括GPU支持和易用性。
ctransformers：一個具有GPU加速、LangChain支持和OpenAI兼容AI服務器的Python庫。截至2023年11月27日，ctransformers 已經很長時間沒有更新，不支持許多最近的模型。

可用的倉庫

提示模板

本模型使用Alpaca提示模板：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:

🔧 技術細節

量化方法說明

點擊查看詳情

新的量化方法如下：

GGML_TYPE_Q2_K：“type-1” 2位量化，超級塊包含16個塊，每個塊有16個權重。塊的縮放和最小值用4位量化，最終每個權重有效使用2.5625位（bpw）。
GGML_TYPE_Q3_K：“type-0” 3位量化，超級塊包含16個塊，每個塊有16個權重。縮放用6位量化，最終使用3.4375 bpw。
GGML_TYPE_Q4_K：“type-1” 4位量化，超級塊包含8個塊，每個塊有32個權重。縮放和最小值用6位量化，最終使用4.5 bpw。
GGML_TYPE_Q5_K：“type-1” 5位量化，與GGML_TYPE_Q4_K具有相同的超級塊結構，最終使用5.5 bpw。
GGML_TYPE_Q6_K：“type-0” 6位量化，超級塊包含16個塊，每個塊有16個權重。縮放用8位量化，最終使用6.5625 bpw。

請參考下面的 提供的文件 表格，查看哪些文件使用了哪些方法以及如何使用。

提供的文件

名稱	量化方法	位數	大小	所需最大RAM	使用場景
daringmaid-20b.Q2_K.gguf	Q2_K	2	8.31 GB	10.81 GB	最小，但有顯著的質量損失，不建議用於大多數用途
daringmaid-20b.Q3_K_S.gguf	Q3_K_S	3	8.66 GB	11.16 GB	非常小，但有較高的質量損失
daringmaid-20b.Q3_K_M.gguf	Q3_K_M	3	9.70 GB	12.20 GB	非常小，但有較高的質量損失
daringmaid-20b.Q3_K_L.gguf	Q3_K_L	3	10.63 GB	13.13 GB	小，但有顯著的質量損失
daringmaid-20b.Q4_0.gguf	Q4_0	4	11.29 GB	13.79 GB	舊版；小，但有非常高的質量損失，建議使用Q3_K_M
daringmaid-20b.Q4_K_S.gguf	Q4_K_S	4	11.34 GB	13.84 GB	小，但有較大的質量損失
daringmaid-20b.Q4_K_M.gguf	Q4_K_M	4	12.04 GB	14.54 GB	中等，質量平衡，推薦使用
daringmaid-20b.Q5_0.gguf	Q5_0	5	13.77 GB	16.27 GB	舊版；中等，質量平衡，建議使用Q4_K_M
daringmaid-20b.Q5_K_S.gguf	Q5_K_S	5	13.77 GB	16.27 GB	大，質量損失低，推薦使用
daringmaid-20b.Q5_K_M.gguf	Q5_K_M	5	14.16 GB	16.66 GB	大，質量損失非常低，推薦使用
daringmaid-20b.Q6_K.gguf	Q6_K	6	16.41 GB	18.91 GB	非常大，質量損失極低
daringmaid-20b.Q8_0.gguf	Q8_0	8	21.25 GB	23.75 GB	非常大，質量損失極低，但不建議使用

注意：上述RAM數字假設沒有進行GPU卸載。如果將層卸載到GPU，這將減少RAM使用並使用VRAM。

📄 許可證

源模型的創建者將其許可證列為 cc-by-nc-4.0，因此本次量化也使用了相同的許可證。

由於該模型基於Llama 2，它也受Meta Llama 2許可證條款的約束，並且額外包含了該許可證文件。因此，應認為該模型聲稱同時受這兩個許可證的約束。我已聯繫Hugging Face以澄清雙重許可問題，但他們尚未有官方立場。如果情況發生變化，或者Meta對此情況提供任何反饋，我將相應更新此部分。

在此期間，任何關於許可證的問題，特別是這兩個許可證如何相互作用的問題，應直接諮詢原始模型倉庫：Kooten的DaringMaid 20B。