Model Overview
Model Features
Model Capabilities
Use Cases
đ Llamacpp imatrix Quantizations of Qwen2.5-Coder-14B-Instruct-abliterated
This project provides llama.cpp imatrix quantizations of the Qwen2.5-Coder-14B-Instruct-abliterated model, enabling efficient use in various environments.
đ Quick Start
- Quantization Tool: Use llama.cpp release b4058 for quantization.
- Original Model: You can access the original model at https://huggingface.co/huihui-ai/Qwen2.5-Coder-14B-Instruct-abliterated.
- Quantization Dataset: All quants are made using the imatrix option with the dataset from here.
- Running Environment: Run these quantized models in LM Studio.
⨠Features
- Multiple Quantization Types: Offer a variety of quantized models, including f16, Q8_0, Q6_K_L, etc., to meet different performance and quality requirements.
- Specific Prompt Format: Define a specific prompt format for interaction with the model.
- Download Options: Provide different download methods and options for different file sizes and scenarios.
đĻ Installation
Install huggingface-cli
First, ensure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Download a Specific File
You can target a specific file you want:
huggingface-cli download bartowski/Qwen2.5-Coder-14B-Instruct-abliterated-GGUF --include "Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_M.gguf" --local-dir ./
Download Split Files
If the model is bigger than 50GB and split into multiple files, to download them all to a local folder, run:
huggingface-cli download bartowski/Qwen2.5-Coder-14B-Instruct-abliterated-GGUF --include "Qwen2.5-Coder-14B-Instruct-abliterated-Q8_0/*" --local-dir ./
You can either specify a new local directory or download them all in place.
đģ Usage Examples
Prompt Format
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
đ Documentation
Download Table
Filename | Quant type | File Size | Split | Description |
---|---|---|---|---|
Qwen2.5-Coder-14B-Instruct-abliterated-f16.gguf | f16 | 29.55GB | false | Full F16 weights. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q8_0.gguf | Q8_0 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q6_K_L.gguf | Q6_K_L | 12.50GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q6_K.gguf | Q6_K | 12.12GB | false | Very high quality, near perfect, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_L.gguf | Q5_K_L | 10.99GB | false | Uses Q8_0 for embed and output weights. High quality, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_M.gguf | Q5_K_M | 10.51GB | false | High quality, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_S.gguf | Q5_K_S | 10.27GB | false | High quality, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_L.gguf | Q4_K_L | 9.57GB | false | Uses Q8_0 for embed and output weights. Good quality, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_M.gguf | Q4_K_M | 8.99GB | false | Good quality, default size for most use cases, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_XL.gguf | Q3_K_XL | 8.61GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_S.gguf | Q4_K_S | 8.57GB | false | Slightly lower quality with more space savings, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0.gguf | Q4_0 | 8.54GB | false | Legacy format, generally not worth using over similarly sized formats |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_8_8.gguf | Q4_0_8_8 | 8.52GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_4_8.gguf | Q4_0_4_8 | 8.52GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_4_4.gguf | Q4_0_4_4 | 8.52GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ4_XS.gguf | IQ4_XS | 8.12GB | false | Decent quality, smaller than Q4_K_S with similar performance, recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_L.gguf | Q3_K_L | 7.92GB | false | Lower quality but usable, good for low RAM availability. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_M.gguf | Q3_K_M | 7.34GB | false | Low quality. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ3_M.gguf | IQ3_M | 6.92GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_S.gguf | Q3_K_S | 6.66GB | false | Low quality, not recommended. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q2_K_L.gguf | Q2_K_L | 6.53GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ3_XS.gguf | IQ3_XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
Qwen2.5-Coder-14B-Instruct-abliterated-Q2_K.gguf | Q2_K | 5.77GB | false | Very low quality but surprisingly usable. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_M.gguf | IQ2_M | 5.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_S.gguf | IQ2_S | 5.00GB | false | Low quality, uses SOTA techniques to be usable. |
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_XS.gguf | IQ2_XS | 4.70GB | false | Low quality, uses SOTA techniques to be usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models, please comment with your findings. The author would like feedback that these are actually used and useful so as not to keep uploading quants no one is using.
Q4_0_X_X
These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request. To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).
Model Selection
A great write - up with charts showing various performances is provided by Artefact2 here.
- Determine Model Size: First, figure out how much RAM and/or VRAM you have. If you want the model to run as fast as possible, aim for a quant with a file size 1 - 2GB smaller than your GPU's total VRAM. If you want the maximum quality, add both your system RAM and your GPU's VRAM together and choose a quant 1 - 2GB smaller than that total.
- Choose between 'I - quant' and 'K - quant': If you don't want to think too much, grab one of the K - quants (format 'QX_K_X', like Q5_K_M). If you want more details, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), look towards the I - quants (format IQX_X, like IQ3_M). These are newer and offer better performance for their size. Note that I - quants can be used on CPU and Apple Metal but are slower than K - quants, and they are not compatible with Vulcan.
đ§ Technical Details
The project uses the llama.cpp
tool for quantization. The specific release version is b4058
. The quantization process is based on the imatrix option and a specific dataset. Different quantization types have different impacts on model quality, performance, and file size. For example, some quantizations use Q8_0 for embed and output weights, which may affect the model's quality.
đ License
This project is licensed under the Apache - 2.0 license.

