Qwen2.5-Coder-14B-Instruct-abliterated-GGUF Open-source Coding Model with Multiple Quantization Types Compatible with Different Hardware

Qwen2.5 Coder 14B Instruct Abliterated GGUF

Developed by bartowski

A quantized version of Qwen2.5-Coder-14B-Instruct-abliterated, supporting multiple quantization types and suitable for different hardware conditions.

Large Language Model Open Source License:Apache-2.0 #Code generation optimization #Multi-quantization adaptation #Low-resource deployment

Downloads 1,240

Release Time : 11/13/2024

Model Overview

This is a quantized version based on the Qwen2.5-Coder-14B-Instruct-abliterated model, aiming to optimize the running efficiency and performance of the model on different hardware through different quantization methods.

Model Features

Multiple quantization options

Provide multiple quantization types from f16 to Q4_K_M to meet different hardware requirements.

Optimize embedding/output weights

Some quantized models use Q8_0 to quantize embedding and output weights, which may improve the model quality.

ARM chip optimization

The Q4_0_X_X quantization type is optimized for ARM chips, significantly improving the running speed.

Model Capabilities

Code generation

Code understanding

Text generation

Use Cases

Software development

Code completion

Provide code completion suggestions in the development environment.

Improve development efficiency

Code explanation

Explain the function and logic of complex code snippets.

Help understand existing code

🚀 Llamacpp imatrix Quantizations of Qwen2.5-Coder-14B-Instruct-abliterated

This project provides llama.cpp imatrix quantizations of the Qwen2.5-Coder-14B-Instruct-abliterated model, enabling efficient use in various environments.

🚀 Quick Start

Quantization Tool: Use llama.cpp release b4058 for quantization.
Original Model: You can access the original model at https://huggingface.co/huihui-ai/Qwen2.5-Coder-14B-Instruct-abliterated.
Quantization Dataset: All quants are made using the imatrix option with the dataset from here.
Running Environment: Run these quantized models in LM Studio.

✨ Features

Multiple Quantization Types: Offer a variety of quantized models, including f16, Q8_0, Q6_K_L, etc., to meet different performance and quality requirements.
Specific Prompt Format: Define a specific prompt format for interaction with the model.
Download Options: Provide different download methods and options for different file sizes and scenarios.

📦 Installation

Install huggingface-cli

First, ensure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Download a Specific File

You can target a specific file you want:

huggingface-cli download bartowski/Qwen2.5-Coder-14B-Instruct-abliterated-GGUF --include "Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_M.gguf" --local-dir ./

Download Split Files

If the model is bigger than 50GB and split into multiple files, to download them all to a local folder, run:

huggingface-cli download bartowski/Qwen2.5-Coder-14B-Instruct-abliterated-GGUF --include "Qwen2.5-Coder-14B-Instruct-abliterated-Q8_0/*" --local-dir ./

You can either specify a new local directory or download them all in place.

💻 Usage Examples

Prompt Format

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

📚 Documentation

Download Table

Filename	Quant type	File Size	Split	Description
Qwen2.5-Coder-14B-Instruct-abliterated-f16.gguf	f16	29.55GB	false	Full F16 weights.
Qwen2.5-Coder-14B-Instruct-abliterated-Q8_0.gguf	Q8_0	15.70GB	false	Extremely high quality, generally unneeded but max available quant.
Qwen2.5-Coder-14B-Instruct-abliterated-Q6_K_L.gguf	Q6_K_L	12.50GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q6_K.gguf	Q6_K	12.12GB	false	Very high quality, near perfect, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_L.gguf	Q5_K_L	10.99GB	false	Uses Q8_0 for embed and output weights. High quality, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_M.gguf	Q5_K_M	10.51GB	false	High quality, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q5_K_S.gguf	Q5_K_S	10.27GB	false	High quality, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_L.gguf	Q4_K_L	9.57GB	false	Uses Q8_0 for embed and output weights. Good quality, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_M.gguf	Q4_K_M	8.99GB	false	Good quality, default size for most use cases, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_XL.gguf	Q3_K_XL	8.61GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_K_S.gguf	Q4_K_S	8.57GB	false	Slightly lower quality with more space savings, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0.gguf	Q4_0	8.54GB	false	Legacy format, generally not worth using over similarly sized formats
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_8_8.gguf	Q4_0_8_8	8.52GB	false	Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_4_8.gguf	Q4_0_4_8	8.52GB	false	Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows.
Qwen2.5-Coder-14B-Instruct-abliterated-Q4_0_4_4.gguf	Q4_0_4_4	8.52GB	false	Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ4_XS.gguf	IQ4_XS	8.12GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_L.gguf	Q3_K_L	7.92GB	false	Lower quality but usable, good for low RAM availability.
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_M.gguf	Q3_K_M	7.34GB	false	Low quality.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ3_M.gguf	IQ3_M	6.92GB	false	Medium-low quality, new method with decent performance comparable to Q3_K_M.
Qwen2.5-Coder-14B-Instruct-abliterated-Q3_K_S.gguf	Q3_K_S	6.66GB	false	Low quality, not recommended.
Qwen2.5-Coder-14B-Instruct-abliterated-Q2_K_L.gguf	Q2_K_L	6.53GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ3_XS.gguf	IQ3_XS	6.38GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
Qwen2.5-Coder-14B-Instruct-abliterated-Q2_K.gguf	Q2_K	5.77GB	false	Very low quality but surprisingly usable.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_M.gguf	IQ2_M	5.36GB	false	Relatively low quality, uses SOTA techniques to be surprisingly usable.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_S.gguf	IQ2_S	5.00GB	false	Low quality, uses SOTA techniques to be usable.
Qwen2.5-Coder-14B-Instruct-abliterated-IQ2_XS.gguf	IQ2_XS	4.70GB	false	Low quality, uses SOTA techniques to be usable.

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models, please comment with your findings. The author would like feedback that these are actually used and useful so as not to keep uploading quants no one is using.

Q4_0_X_X

These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request. To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).

Model Selection

A great write - up with charts showing various performances is provided by Artefact2 here.

Determine Model Size: First, figure out how much RAM and/or VRAM you have. If you want the model to run as fast as possible, aim for a quant with a file size 1 - 2GB smaller than your GPU's total VRAM. If you want the maximum quality, add both your system RAM and your GPU's VRAM together and choose a quant 1 - 2GB smaller than that total.
Choose between 'I - quant' and 'K - quant': If you don't want to think too much, grab one of the K - quants (format 'QX_K_X', like Q5_K_M). If you want more details, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), look towards the I - quants (format IQX_X, like IQ3_M). These are newer and offer better performance for their size. Note that I - quants can be used on CPU and Apple Metal but are slower than K - quants, and they are not compatible with Vulcan.

🔧 Technical Details

The project uses the llama.cpp tool for quantization. The specific release version is b4058. The quantization process is based on the imatrix option and a specific dataset. Different quantization types have different impacts on model quality, performance, and file size. For example, some quantizations use Q8_0 for embed and output weights, which may affect the model's quality.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご