đ Llamacpp imatrix Quantizations of gemma-2-2b-it-abliterated
This project provides Llama.cpp imatrix quantizations of the gemma-2-2b-it-abliterated
model. It offers various quantization types to meet different performance and quality requirements, allowing users to run the model efficiently on different hardware configurations.
đ Quick Start
Prerequisites
Make sure you have the huggingface-cli
installed. You can install it using the following command:
pip install -U "huggingface_hub[cli]"
Downloading a Specific File
To download a specific file, use the following command. For example, to download the gemma-2-2b-it-abliterated-Q4_K_M.gguf
file:
huggingface-cli download bartowski/gemma-2-2b-it-abliterated-GGUF --include "gemma-2-2b-it-abliterated-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is split into multiple files (models larger than 50GB), you can download all the files in a specific split to a local folder using the following command:
huggingface-cli download bartowski/gemma-2-2b-it-abliterated-GGUF --include "gemma-2-2b-it-abliterated-Q8_0/*" --local-dir ./
You can either specify a new local directory or download them all in the current directory (./
).
Running the Model
You can run the quantized models in LM Studio.
⨠Features
- Multiple Quantization Types: Offers a wide range of quantization types, including f32, Q8_0, Q6_K_L, etc., to balance between model quality and file size.
- Embed/Output Weights Optimization: Some quantizations use Q8_0 for embeddings and output weights, which may improve model quality.
- Easy Download: Supports downloading specific files or split files using the
huggingface-cli
.
đĻ Installation
The installation mainly involves installing the huggingface-cli
and downloading the desired quantized model files. Refer to the "Quick Start" section for detailed steps.
đģ Usage Examples
Prompt Format
The following is the prompt format for this model. Note that this model does not support a System prompt.
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model
đ Documentation
Model Information
Property |
Details |
Base Model |
IlyaGusev/gemma-2-2b-it-abliterated |
Language |
en |
License |
gemma |
Pipeline Tag |
text-generation |
Quantized By |
bartowski |
Downloadable Files
Filename |
Quant type |
File Size |
Split |
Description |
gemma-2-2b-it-abliterated-f32.gguf |
f32 |
10.46GB |
false |
Full F32 weights. |
gemma-2-2b-it-abliterated-Q8_0.gguf |
Q8_0 |
2.78GB |
false |
Extremely high quality, generally unneeded but max available quant. |
gemma-2-2b-it-abliterated-Q6_K_L.gguf |
Q6_K_L |
2.29GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
gemma-2-2b-it-abliterated-Q6_K.gguf |
Q6_K |
2.15GB |
false |
Very high quality, near perfect, recommended. |
gemma-2-2b-it-abliterated-Q5_K_L.gguf |
Q5_K_L |
2.07GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
gemma-2-2b-it-abliterated-Q5_K_M.gguf |
Q5_K_M |
1.92GB |
false |
High quality, recommended. |
gemma-2-2b-it-abliterated-Q5_K_S.gguf |
Q5_K_S |
1.88GB |
false |
High quality, recommended. |
gemma-2-2b-it-abliterated-Q4_K_L.gguf |
Q4_K_L |
1.85GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
gemma-2-2b-it-abliterated-Q4_K_M.gguf |
Q4_K_M |
1.71GB |
false |
Good quality, default size for must use cases, recommended. |
gemma-2-2b-it-abliterated-Q3_K_XL.gguf |
Q3_K_XL |
1.69GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
gemma-2-2b-it-abliterated-Q4_K_S.gguf |
Q4_K_S |
1.64GB |
false |
Slightly lower quality with more space savings, recommended. |
gemma-2-2b-it-abliterated-IQ4_XS.gguf |
IQ4_XS |
1.57GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
gemma-2-2b-it-abliterated-Q3_K_L.gguf |
Q3_K_L |
1.55GB |
false |
Lower quality but usable, good for low RAM availability. |
gemma-2-2b-it-abliterated-IQ3_M.gguf |
IQ3_M |
1.39GB |
false |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
gemma-2-2b-it-abliterated-Q2_K_L.gguf |
Q2_K_L |
1.37GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default. Some users claim that this improves the quality, while others don't notice any difference. If you use these models, please comment with your findings.
đ§ Technical Details
Quantization Process
The quantization is performed using llama.cpp release b3496. All quants are made using the imatrix option with the dataset from here.
File Selection
A great write-up with charts showing various performances is provided by Artefact2 here. To choose the appropriate file, first, determine how much RAM and/or VRAM you have. If you want the model to run as fast as possible, try to fit the whole model on your GPU's VRAM. If you want the maximum quality, add your system RAM and GPU's VRAM together and choose a quant with a file size 1 - 2GB smaller than the total.
đ License
The model is licensed under the gemma
license.
Credits
- Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
- Thank you ZeroWw for the inspiration to experiment with embed/output.
đĄ Usage Tip
If you want to support the developer's work, you can visit the ko-fi page here: https://ko-fi.com/bartowski