Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Llamacpp imatrix Quantizations of 72B-Qwen2.5-Kunou-v1
This project provides Llama.cpp imatrix quantizations of the 72B-Qwen2.5-Kunou-v1 model. It enables efficient deployment and usage of the model in various environments, especially in LM Studio.
🚀 Quick Start
Quantization Process
We use llama.cpp release b4273 for quantization. The original model can be found at Sao10K/72B-Qwen2.5-Kunou-v1. All quantizations are made using the imatrix option with a dataset from here.
Running the Model
You can run the quantized models in LM Studio.
Prompt Format
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
✨ Features
Multiple Quantization Types
We offer a wide range of quantization types, each with different file sizes and quality levels. You can choose the one that best suits your needs based on your available resources and performance requirements.
Embed/Output Weights
Some of the quantizations (e.g., Q3_K_XL, Q4_K_L) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the default values.
Online Repacking for ARM
Thanks to recent efforts, you can use Q4_0 or IQ4_NL for better performance on ARM devices if your llama.cpp has been compiled accordingly.
📦 Installation
Downloading using huggingface-cli
Click to view download instructions
First, make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
To download a specific file:
huggingface-cli download bartowski/72B-Qwen2.5-Kunou-v1-GGUF --include "72B-Qwen2.5-Kunou-v1-Q4_K_M.gguf" --local-dir ./
If the model is bigger than 50GB and split into multiple files, download them all to a local folder:
huggingface-cli download bartowski/72B-Qwen2.5-Kunou-v1-GGUF --include "72B-Qwen2.5-Kunou-v1-Q8_0/*" --local-dir ./
You can either specify a new local directory or download them all in the current directory (./).
💻 Usage Examples
Downloading a Single File
huggingface-cli download bartowski/72B-Qwen2.5-Kunou-v1-GGUF --include "72B-Qwen2.5-Kunou-v1-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
huggingface-cli download bartowski/72B-Qwen2.5-Kunou-v1-GGUF --include "72B-Qwen2.5-Kunou-v1-Q8_0/*" --local-dir ./
📚 Documentation
File Download Options
You can download individual files or split files depending on the model size. Refer to the "Downloading using huggingface-cli" section for detailed instructions.
Q4_0_X_X Information
New: If your llama.cpp has been compiled for your ARM device, you can use Q4_0 for online repacking of weights. Similarly, IQ4_NL can provide slightly better performance on ARM thanks to recent improvements.
Choosing the Right File
A great write-up with charts showing various performances is provided by Artefact2 here. First, determine how much RAM and/or VRAM you have to figure out the appropriate model size. If you want the model to run as fast as possible, choose a quant with a file size 1-2GB smaller than your GPU's total VRAM.
🔧 Technical Details
Quantization Types and Performance
The following table provides an overview of the available quantization types, their file sizes, split status, and descriptions:
Filename | Quant type | File Size | Split | Description |
---|---|---|---|---|
72B-Qwen2.5-Kunou-v1-Q8_0.gguf | Q8_0 | 77.26GB | true | Extremely high quality, generally unneeded but max available quant. |
72B-Qwen2.5-Kunou-v1-Q6_K.gguf | Q6_K | 64.35GB | true | Very high quality, near perfect, recommended. |
72B-Qwen2.5-Kunou-v1-Q5_K_M.gguf | Q5_K_M | 54.45GB | true | High quality, recommended. |
72B-Qwen2.5-Kunou-v1-Q5_K_S.gguf | Q5_K_S | 51.38GB | true | High quality, recommended. |
72B-Qwen2.5-Kunou-v1-Q4_K_L.gguf | Q4_K_L | 48.34GB | false | Uses Q8_0 for embed and output weights. Good quality, recommended. |
72B-Qwen2.5-Kunou-v1-Q4_K_M.gguf | Q4_K_M | 47.42GB | false | Good quality, default size for most use cases, recommended. |
72B-Qwen2.5-Kunou-v1-Q4_K_S.gguf | Q4_K_S | 43.89GB | false | Slightly lower quality with more space savings, recommended. |
72B-Qwen2.5-Kunou-v1-Q4_0.gguf | Q4_0 | 41.38GB | false | Legacy format, offers online repacking for ARM CPU inference. |
72B-Qwen2.5-Kunou-v1-IQ4_NL.gguf | IQ4_NL | 41.32GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
72B-Qwen2.5-Kunou-v1-Q4_0_8_8.gguf | Q4_0_8_8 | 41.23GB | false | Optimized for ARM and AVX inference. Requires 'sve' support for ARM (see details below). Don't use on Mac. |
72B-Qwen2.5-Kunou-v1-Q4_0_4_8.gguf | Q4_0_4_8 | 41.23GB | false | Optimized for ARM inference. Requires 'i8mm' support (see details below). Don't use on Mac. |
72B-Qwen2.5-Kunou-v1-Q4_0_4_4.gguf | Q4_0_4_4 | 41.23GB | false | Optimized for ARM inference. Should work well on all ARM chips, not for use with GPUs. Don't use on Mac. |
72B-Qwen2.5-Kunou-v1-Q3_K_XL.gguf | Q3_K_XL | 40.60GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
72B-Qwen2.5-Kunou-v1-IQ4_XS.gguf | IQ4_XS | 39.71GB | false | Decent quality, smaller than Q4_K_S with similar performance, recommended. |
72B-Qwen2.5-Kunou-v1-Q3_K_L.gguf | Q3_K_L | 39.51GB | false | Lower quality but usable, good for low RAM availability. |
72B-Qwen2.5-Kunou-v1-Q3_K_M.gguf | Q3_K_M | 37.70GB | false | Low quality. |
72B-Qwen2.5-Kunou-v1-IQ3_M.gguf | IQ3_M | 35.50GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
72B-Qwen2.5-Kunou-v1-Q3_K_S.gguf | Q3_K_S | 34.49GB | false | Low quality, not recommended. |
72B-Qwen2.5-Kunou-v1-IQ3_XXS.gguf | IQ3_XXS | 31.85GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. |
72B-Qwen2.5-Kunou-v1-Q2_K_L.gguf | Q2_K_L | 31.03GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
72B-Qwen2.5-Kunou-v1-Q2_K.gguf | Q2_K | 29.81GB | false | Very low quality but surprisingly usable. |
72B-Qwen2.5-Kunou-v1-IQ2_M.gguf | IQ2_M | 29.34GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
72B-Qwen2.5-Kunou-v1-IQ2_S.gguf | IQ2_S | 27.94GB | false | Low quality, uses SOTA techniques to be usable. |
72B-Qwen2.5-Kunou-v1-IQ2_XS.gguf | IQ2_XS | 27.06GB | false | Low quality, uses SOTA techniques to be usable. |
72B-Qwen2.5-Kunou-v1-IQ2_XXS.gguf | IQ2_XXS | 25.49GB | false | Very low quality, uses SOTA techniques to be usable. |
72B-Qwen2.5-Kunou-v1-IQ1_M.gguf | IQ1_M | 23.74GB | false | Extremely low quality, not recommended. |
Q4_0_X_X Performance on ARM and AVX
The Q4_0_X_X quantizations are optimized for ARM and certain AVX2/AVX512 CPUs. They are not suitable for Metal (Apple) or GPU (nvidia/AMD/intel) offloading. Check out the original pull request for Q4_0_4_4 speed comparisons.
Benchmarks on an AVX2 System (EPYC7702)
model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
---|---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a significant improvement in prompt processing and a small improvement in text generation.
📄 License
This project uses the Qwen license.

