đ DeepSeek-V2-Chat-GGUF
This is a quantized model from DeepSeek-V2-Chat, offering various quantization options for different usage scenarios.

Quantizised from https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
Using llama.cpp b3026 for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
â ī¸ Important Note
Please set the metadata KV overrides below.
đ Quick Start
⨠Features
- Quantized from the original DeepSeek-V2-Chat model.
- Supports multiple quantization options with different quality and size trade - offs.
- Can be run in command - line chat mode or through an OpenAI compatible server using llama.cpp.
đĻ Installation
Downloading the bf16
- Find the relevant directory.
- Download all files.
- Run
merge.py
.
- The merged GGUF should appear.
Downloading the quantizations
- Find the relevant directory.
- Download all files.
- Point to the first split (most programs should load all the splits automatically now).
đģ Usage Examples
Basic Usage
To start in command line chat mode (chat completion):
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)
Advanced Usage
To use llama.cpp's OpenAI compatible server:
server \
-m DeepSeek-V2-Chat.{quant}.gguf \
-c {context_length} \
(--color [recommended: colored output in supported terminals]) \
(-i [note: interactive mode]) \
(--mlock [note: avoid using swap]) \
(--verbose) \
(--log-disable [note: disable logging to file, may be useful for prod]) \
(--metrics [note: prometheus compatible monitoring endpoint]) \
(--api-key [string]) \
(--port [int]) \
(--flash-attn [note: must be fully offloaded to supported GPU])
Making an importance matrix:
imatrix \
-m DeepSeek-V2-Chat.{quant}.gguf \
-f groups_merged.txt \
--verbosity [0, 1, 2] \
-ngl {GPU offloading; must build with CUDA} \
--ofreq {recommended: 1}
Making a quant:
quantize \
DeepSeek-V2-Chat.bf16.gguf \
DeepSeek-V2-Chat.{quant}.gguf \
{quant} \
(--imatrix [file])
â ī¸ Important Note
Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.
đ Documentation
Quants
Property |
Details |
BF16 |
Available, 439 GB, Lossless :) Old, Not weighted. Q8_0 is sufficient for most cases |
Q8_0 |
Available, 233.27 GB, High quality recommended, Updated, Weighted |
Q8_0 |
Available, ~110 GB, High quality recommended, Updated, Weighted |
Q5_K_M |
Available, 155 GB, Medium - high quality recommended, Updated, Weighted |
Q4_K_M |
Available, 132 GB, Medium quality recommended, Old, Not weighted |
Q3_K_M |
Available, 104 GB, Medium - low quality, Updated, Weighted |
IQ3_XS |
Available, 89.6 GB, Better than Q3_K_M, Old, Weighted |
Q2_K |
Available, 80.0 GB, Low quality not recommended, Old, Not weighted |
IQ2_XXS |
Available, 61.5 GB, Lower quality not recommended, Old, Weighted |
IQ1_M |
Uploading, 27.3 GB, Extremely low quality not recommended, Old, Weighted. Testing purposes; use IQ2 at least |
Planned Quants (weighted/iMatrix)
Planned Quant |
Notes |
Q5_K_S |
|
Q4_K_S |
|
Q3_K_S |
|
IQ4_XS |
|
IQ2_XS |
|
IQ2_S |
|
IQ2_M |
|
Metadata KV overrides (pass them using --override-kv
, can be specified multiple times):
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
đ§ Technical Details
đ License
- DeepSeek license for model weights, which can be found in the
LICENSE
file in the root of this repo.
- MIT license for any repo code.
Censorship
This model is a bit censored, finetuning on toxic DPO might help.