Kimi-Dev-72B-GGUF Open Source Model - The quantized version reduces storage and computing requirements, a practical choice!

Kimi Dev 72B GGUF

Developed by ubergarm

A quantized version of Kimi-Dev-72B, using advanced nonlinear optimal quantization and multi-head latent attention mechanism to reduce storage and computing requirements.

Large Language Model OtherOpen Source License:MIT #High compression ratio quantization #Large language model #Text generation

Downloads 2,780

Release Time : 6/19/2025

Model Overview

This model is a quantized version of Kimi-Dev-72B. Through a specific quantization method, it reduces resource consumption while ensuring performance and is suitable for text generation tasks.

Model Features

Advanced quantization method

Using nonlinear optimal quantization and multi-head latent attention mechanism, significantly reducing the model's storage and computing requirements.

High-performance inference

On high-end hardware configurations, at 2k per batch, PP is about 500 tokens/second and TG is about 5 tokens/second.

Balanced quality and speed

Through a series of experimental quantization tests, a good balance has been achieved between quality and speed.

Model Capabilities

Text generation

Efficient inference

Quantized model support

Use Cases

Text generation

Efficient text generation

While ensuring a certain level of performance, reduce the model's storage and computing requirements, suitable for scenarios that require efficient text generation.

At 2k per batch, PP is about 500 tokens/second and TG is about 5 tokens/second.

🚀 ik_llama.cpp imatrix Quantizations of Kimi - Dev - 72B

This project offers quantized versions of the Kimi - Dev - 72B model. It focuses on providing high - performance quantizations that require a specific fork of ik_llama.cpp to function properly.

✨ Features

Quantization Requirements

This quant collection REQUIRES the ik_llama.cpp fork to support advanced non - linear SotA quants and Multi - Head Latent Attention (MLA). Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! Though might work in Nexesenex's croco.cpp kobold fork (untested).

smol - IQ3_K Quantization Details

Size: 32.273 GiB (3.813 BPW)
Tensor Types:
- type f32: 401 tensors
- type q4_K: 1 tensors token_embd
- type q6_K: 1 tensors output ("head")
- type iq4_nl: 80 tensors down
- type iq3_k: 320 tensors (q|o) (gate|up)
- type iq4_k: 160 tensors (k|v)

📦 Installation

Quickstart

# Clone
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp

# Build (might try adding -DGGML_CUDA_IQK_FORCE_BF16=1 for 3090s and older)
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Run (set threads to number of CPU physical cores, mmap is fine for faster startup, adjust ctx/ngl as needed)
./build/bin/llama-server \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 8192 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ngl 48 \
    --threads 16 \
    --parallel 1 \
    --host 127.0.0.1 \
    --port 8080

📚 Documentation

Benchmarks

Speed

Hardware Setup:
- AMD 9950X
- Overclocked infinity fabric "gear 1" clocks
- 2x 48GB DDR5@6400 RAM (~87GB/s benchmarked)
- 3090 TI FE 24GB VRAM @ 450 Watts (uncapped)
Performance Metrics:
- PP ~500 tok/sec with 2k batches
- TG ~5 tok/sec limited by RAM i/o bandwidth

./build/bin/llama-sweep-bench \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 6144 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ub 2048 -b 2048 \
    -ngl 48 \
    --warmup-batch \
    --threads 16

ubergarm/Kimmy - Dev - 72B - smol - IQ3_K Performance Table

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	3.925	521.77	103.624	4.94
2048	512	2048	4.058	504.63	105.265	4.86

Quality

I tested perplexity for a bunch of experimental quants, decided this one was a decent trade - off between quality and speed.

Perplexity Chart

FAQ

Why is it smol?

I ran out of names making a bunch of similar sized quants for the Perplexity graph above lol.

Will you make larger GGUFs?

Naw, you can get good mainline llama.cpp GGUFs from others already like bartowski and bullerwins.

Where can I get those hot new EXL3 quants?

Check out ArtusDev's collection

What about the new iqK_kt QTIP Trellis style quants?

I may release something eventually, but they are still pretty fresh gonna wait a minute to see if any breaking changes happen before releasing.
Also the column dimension of the ffn_down tensor is not divisible by 256 so had to use iq4_nl unless something changes.

References

ik_llama.cpp

Information Table

Property	Details
Quantized By	ubergarm
Pipeline Tag	text - generation
Base Model	moonshotai/Kimi - Dev - 72B
License	mit
Base Model Relation	quantized
Tags	code, imatrix, ik_llama.cpp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご