🚀 ik_llama.cpp imatrix Quantizations of Kimi - Dev - 72B
This project offers quantized versions of the Kimi - Dev - 72B model. It focuses on providing high - performance quantizations that require a specific fork of ik_llama.cpp
to function properly.
✨ Features
Quantization Requirements
This quant collection REQUIRES the ik_llama.cpp fork to support advanced non - linear SotA quants and Multi - Head Latent Attention (MLA). Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! Though might work in Nexesenex's croco.cpp kobold fork (untested).
smol - IQ3_K Quantization Details
- Size: 32.273 GiB (3.813 BPW)
- Tensor Types:
- type f32: 401 tensors
- type q4_K: 1 tensors token_embd
- type q6_K: 1 tensors output ("head")
- type iq4_nl: 80 tensors down
- type iq3_k: 320 tensors (q|o) (gate|up)
- type iq4_k: 160 tensors (k|v)
📦 Installation
Quickstart
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
./build/bin/llama-server \
--model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
--ctx-size 8192 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
-ngl 48 \
--threads 16 \
--parallel 1 \
--host 127.0.0.1 \
--port 8080
📚 Documentation
Benchmarks
Speed
- Hardware Setup:
- AMD 9950X
- Overclocked infinity fabric "gear 1" clocks
- 2x 48GB DDR5@6400 RAM (~87GB/s benchmarked)
- 3090 TI FE 24GB VRAM @ 450 Watts (uncapped)
- Performance Metrics:
- PP ~500 tok/sec with 2k batches
- TG ~5 tok/sec limited by RAM i/o bandwidth
./build/bin/llama-sweep-bench \
--model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
--ctx-size 6144 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
-ub 2048 -b 2048 \
-ngl 48 \
--warmup-batch \
--threads 16
ubergarm/Kimmy - Dev - 72B - smol - IQ3_K Performance Table
PP |
TG |
N_KV |
T_PP s |
S_PP t/s |
T_TG s |
S_TG t/s |
2048 |
512 |
0 |
3.925 |
521.77 |
103.624 |
4.94 |
2048 |
512 |
2048 |
4.058 |
504.63 |
105.265 |
4.86 |
Quality
I tested perplexity for a bunch of experimental quants, decided this one was a decent trade - off between quality and speed.

FAQ
- Why is it
smol
?
- I ran out of names making a bunch of similar sized quants for the Perplexity graph above lol.
- Will you make larger GGUFs?
- Naw, you can get good mainline llama.cpp GGUFs from others already like bartowski and bullerwins.
- Where can I get those hot new EXL3 quants?
- What about the new
iqK_kt
QTIP Trellis style quants?
- I may release something eventually, but they are still pretty fresh gonna wait a minute to see if any breaking changes happen before releasing.
- Also the column dimension of the
ffn_down
tensor is not divisible by 256 so had to use iq4_nl
unless something changes.
References
Information Table
Property |
Details |
Quantized By |
ubergarm |
Pipeline Tag |
text - generation |
Base Model |
moonshotai/Kimi - Dev - 72B |
License |
mit |
Base Model Relation |
quantized |
Tags |
code, imatrix, ik_llama.cpp |