Model Overview
Model Features
Model Capabilities
Use Cases
đ Llamacpp imatrix Quantizations of gemma-2-9b-it-abliterated
This project provides Llama.cpp imatrix quantizations of the gemma-2-9b-it-abliterated
model. It offers various quantization types to meet different performance and quality requirements, and provides guidance on downloading and using these quantized models.
đ Quick Start
Prerequisites
- Ensure you have
huggingface-cli
installed. You can install it using the following command:
pip install -U "huggingface_hub[cli]"
Downloading a Specific File
To download a specific quantized model file, use the following command. For example, to download gemma-2-9b-it-abliterated-Q4_K_M.gguf
:
huggingface-cli download bartowski/gemma-2-9b-it-abliterated-GGUF --include "gemma-2-9b-it-abliterated-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is split into multiple files (models larger than 50GB), you can download all the files to a local folder using the following command:
huggingface-cli download bartowski/gemma-2-9b-it-abliterated-GGUF --include "gemma-2-9b-it-abliterated-Q8_0/*" --local-dir ./
Running the Model
You can run these quantized models in LM Studio.
⨠Features
- Multiple Quantization Types: Offers a wide range of quantization types, including f32, Q8_0, Q6_K_L, Q5_K_L, etc., to balance between model quality and file size.
- Optimized for Different Hardware: Some quantization types are optimized for ARM chips, providing significant speed improvements.
- Embed/Output Weights Option: Some quantizations use Q8_0 for embed and output weights, which may improve model quality.
đĻ Installation
Installing huggingface-cli
pip install -U "huggingface_hub[cli]"
Downloading Specific Files
huggingface-cli download bartowski/gemma-2-9b-it-abliterated-GGUF --include "gemma-2-9b-it-abliterated-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
huggingface-cli download bartowski/gemma-2-9b-it-abliterated-GGUF --include "gemma-2-9b-it-abliterated-Q8_0/*" --local-dir ./
đģ Usage Examples
Prompt Format
<bos><start_of_turn>system
{system_prompt}<end_of_turn>
<start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model
đ Documentation
Model Information
Property | Details |
---|---|
Base Model | IlyaGusev/gemma-2-9b-it-abliterated |
Language | en |
License | gemma |
Pipeline Tag | text-generation |
Quantized By | bartowski |
Downloadable Files
Filename | Quant type | File Size | Split | Description |
---|---|---|---|---|
gemma-2-9b-it-abliterated-f32.gguf | f32 | 36.97GB | false | Full F32 weights. |
gemma-2-9b-it-abliterated-f32.gguf | f32 | 36.97GB | false | Full F32 weights. |
gemma-2-9b-it-abliterated-Q8_0.gguf | Q8_0 | 9.83GB | false | Extremely high quality, generally unneeded but max available quant. |
gemma-2-9b-it-abliterated-Q6_K_L.gguf | Q6_K_L | 7.81GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
gemma-2-9b-it-abliterated-Q6_K.gguf | Q6_K | 7.59GB | false | Very high quality, near perfect, recommended. |
gemma-2-9b-it-abliterated-Q5_K_L.gguf | Q5_K_L | 6.87GB | false | Uses Q8_0 for embed and output weights. High quality, recommended. |
gemma-2-9b-it-abliterated-Q5_K_M.gguf | Q5_K_M | 6.65GB | false | High quality, recommended. |
gemma-2-9b-it-abliterated-Q5_K_S.gguf | Q5_K_S | 6.48GB | false | High quality, recommended. |
gemma-2-9b-it-abliterated-Q4_K_L.gguf | Q4_K_L | 5.98GB | false | Uses Q8_0 for embed and output weights. Good quality, recommended. |
gemma-2-9b-it-abliterated-Q4_K_M.gguf | Q4_K_M | 5.76GB | false | Good quality, default size for must use cases, recommended. |
gemma-2-9b-it-abliterated-Q4_K_S.gguf | Q4_K_S | 5.48GB | false | Slightly lower quality with more space savings, recommended. |
gemma-2-9b-it-abliterated-Q4_0.gguf | Q4_0 | 5.46GB | false | Legacy format, offers online repacking for ARM and AVX inference. |
gemma-2-9b-it-abliterated-Q4_0_8_8.gguf | Q4_0_8_8 | 5.44GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). |
gemma-2-9b-it-abliterated-Q4_0_4_8.gguf | Q4_0_4_8 | 5.44GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). |
gemma-2-9b-it-abliterated-Q4_0_4_4.gguf | Q4_0_4_4 | 5.44GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. |
gemma-2-9b-it-abliterated-Q3_K_XL.gguf | Q3_K_XL | 5.35GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
gemma-2-9b-it-abliterated-IQ4_XS.gguf | IQ4_XS | 5.18GB | false | Decent quality, smaller than Q4_K_S with similar performance, recommended. |
gemma-2-9b-it-abliterated-Q3_K_L.gguf | Q3_K_L | 5.13GB | false | Lower quality but usable, good for low RAM availability. |
gemma-2-9b-it-abliterated-Q3_K_M.gguf | Q3_K_M | 4.76GB | false | Low quality. |
gemma-2-9b-it-abliterated-IQ3_M.gguf | IQ3_M | 4.49GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
gemma-2-9b-it-abliterated-Q3_K_S.gguf | Q3_K_S | 4.34GB | false | Low quality, not recommended. |
gemma-2-9b-it-abliterated-IQ3_XS.gguf | IQ3_XS | 4.14GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
gemma-2-9b-it-abliterated-Q2_K_L.gguf | Q2_K_L | 4.03GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
gemma-2-9b-it-abliterated-Q2_K.gguf | Q2_K | 3.81GB | false | Very low quality but surprisingly usable. |
gemma-2-9b-it-abliterated-IQ2_M.gguf | IQ2_M | 3.43GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the default. Some users claim this improves the quality, while others notice no difference. Please share your findings if you use these models.
Q4_0_X_X
These quantizations are optimized for ARM chips, not for Metal (Apple) offloading. They can provide a substantial speedup on ARM chips. Check the AArch64 SoC features to find the best option for your ARM chip.
File Selection
A detailed analysis with performance charts is available here. Consider your available RAM and VRAM when choosing a model. For maximum speed, select a quantized model with a file size 1-2GB smaller than your GPU's VRAM. For maximum quality, combine your system RAM and GPU's VRAM and choose a model 1-2GB smaller than the total. You can also choose between 'I-quants' and 'K-quants' based on your specific requirements.
đ§ Technical Details
Quantization Method
The quantizations are performed using llama.cpp release b3878. All quants are made using the imatrix option with a dataset from here.
ARM Optimization
The Q4_0_X_X quants are optimized for ARM chips and can provide significant speed improvements. Check the original pull request for speed comparisons.
đ License
The project uses the gemma
license.
Credits
- Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
- Thank you ZeroWw for the inspiration to experiment with embed/output.
If you want to support the developer's work, visit the ko-fi page.

