DeepSeek-V2-Chat-GGUF Open-Source Model - A Practical Choice for Free Local Deployment and Operation

Deepseek V2 Chat GGUF

Developed by leafspark

The GGUF quantized version of DeepSeek-V2-Chat, suitable for local deployment and operation.

Large Language Model Supports Multiple LanguagesOpen Source License:MIT #Multilingual dialogue #Efficient quantization #Long context support

Downloads 1,388

Release Time : 5/17/2024

Model Overview

DeepSeek-V2-Chat is a large language model based on GGUF quantization, supporting Chinese and English text generation tasks. This model is quantized through llama.cpp and is suitable for local inference.

Model Features

Support for multiple quantization versions

Provide multiple quantization versions from BF16 to IQ1_M to meet different hardware and performance requirements.

Efficient local operation

Support local deployment through llama.cpp, suitable for inference scenarios without cloud dependencies.

Support for Chinese and English

The model supports Chinese and English text generation tasks, suitable for multilingual application scenarios.

Model Capabilities

Text generation

Chat completion

Code generation

Use Cases

Chat applications

Command-line chat mode

Run the command-line chat mode through llama.cpp, supporting interactive dialogue.

API services

OpenAI-compatible server

Deploy as an OpenAI-compatible API service, supporting remote calls.

🚀 DeepSeek-V2-Chat-GGUF

This is a quantized model from DeepSeek-V2-Chat, offering various quantization options for different usage scenarios.

image/jpeg

Quantizised from https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat

Using llama.cpp b3026 for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.

⚠️ Important Note

Please set the metadata KV overrides below.

🚀 Quick Start

✨ Features

Quantized from the original DeepSeek-V2-Chat model.
Supports multiple quantization options with different quality and size trade - offs.
Can be run in command - line chat mode or through an OpenAI compatible server using llama.cpp.

📦 Installation

Downloading the bf16

Find the relevant directory.
Download all files.
Run merge.py.
The merged GGUF should appear.

Downloading the quantizations

Find the relevant directory.
Download all files.
Point to the first split (most programs should load all the splits automatically now).

💻 Usage Examples

Basic Usage

To start in command line chat mode (chat completion):

main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)

Advanced Usage

To use llama.cpp's OpenAI compatible server:

server \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -c {context_length} \
  (--color [recommended: colored output in supported terminals]) \
  (-i [note: interactive mode]) \
  (--mlock [note: avoid using swap]) \
  (--verbose) \
  (--log-disable [note: disable logging to file, may be useful for prod]) \
  (--metrics [note: prometheus compatible monitoring endpoint]) \
  (--api-key [string]) \
  (--port [int]) \
  (--flash-attn [note: must be fully offloaded to supported GPU])

Making an importance matrix:

imatrix \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -f groups_merged.txt \
  --verbosity [0, 1, 2] \
  -ngl {GPU offloading; must build with CUDA} \
  --ofreq {recommended: 1}

Making a quant:

quantize \
  DeepSeek-V2-Chat.bf16.gguf \
  DeepSeek-V2-Chat.{quant}.gguf \
  {quant} \
  (--imatrix [file])

⚠️ Important Note

Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.

📚 Documentation

Quants

Property	Details
BF16	Available, 439 GB, Lossless :) Old, Not weighted. Q8_0 is sufficient for most cases
Q8_0	Available, 233.27 GB, High quality recommended, Updated, Weighted
Q8_0	Available, ~110 GB, High quality recommended, Updated, Weighted
Q5_K_M	Available, 155 GB, Medium - high quality recommended, Updated, Weighted
Q4_K_M	Available, 132 GB, Medium quality recommended, Old, Not weighted
Q3_K_M	Available, 104 GB, Medium - low quality, Updated, Weighted
IQ3_XS	Available, 89.6 GB, Better than Q3_K_M, Old, Weighted
Q2_K	Available, 80.0 GB, Low quality not recommended, Old, Not weighted
IQ2_XXS	Available, 61.5 GB, Lower quality not recommended, Old, Weighted
IQ1_M	Uploading, 27.3 GB, Extremely low quality not recommended, Old, Weighted. Testing purposes; use IQ2 at least

Planned Quants (weighted/iMatrix)

Planned Quant	Notes
Q5_K_S
Q4_K_S
Q3_K_S
IQ4_XS
IQ2_XS
IQ2_S
IQ2_M

Metadata KV overrides (pass them using --override-kv, can be specified multiple times):

deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707

🔧 Technical Details

Performance: ~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K].
iMatrix: Find imatrix.dat in the root of this repo, made with a Q2_K quant containing 62 chunks (see here for info: https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693). Using groups_merged.txt, find it here: https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384.

📄 License

DeepSeek license for model weights, which can be found in the LICENSE file in the root of this repo.
MIT license for any repo code.

Censorship

This model is a bit censored, finetuning on toxic DPO might help.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご