đ Llamacpp imatrix Quantizations of WizardLM-2-7B-abliterated
This project provides quantized versions of the WizardLM-2-7B-abliterated model using llama.cpp, offering various quantization types to meet different performance and quality requirements.
đ Quick Start
Prerequisites
First, make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Download a Specific File
You can target the specific file you want:
huggingface-cli download bartowski/WizardLM-2-7B-abliterated-GGUF --include "WizardLM-2-7B-abliterated-Q4_K_M.gguf" --local-dir ./
Download Split Files
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
huggingface-cli download bartowski/WizardLM-2-7B-abliterated-GGUF --include "WizardLM-2-7B-abliterated-Q8_0.gguf/*" --local-dir WizardLM-2-7B-abliterated-Q8_0
You can either specify a new local-dir
(e.g., WizardLM-2-7B-abliterated-Q8_0
) or download them all in place (.
)
⨠Features
- Quantization: Using llama.cpp release b2965 for quantization.
- Multiple Quantization Types: All quants are made using the imatrix option with a dataset from here.
đ Documentation
Original Model
The original model can be found at: https://huggingface.co/fearlessdots/WizardLM-2-7B-abliterated
Prompt Format
{system_prompt} USER: {prompt} ASSISTANT: </s>
Download Options
You can download a file (not the whole branch) from the following table:
Filename |
Quant type |
File Size |
Description |
WizardLM-2-7B-abliterated-Q8_0.gguf |
Q8_0 |
7.69GB |
Extremely high quality, generally unneeded but max available quant. |
WizardLM-2-7B-abliterated-Q6_K.gguf |
Q6_K |
5.94GB |
Very high quality, near perfect, recommended. |
WizardLM-2-7B-abliterated-Q5_K_M.gguf |
Q5_K_M |
5.13GB |
High quality, recommended. |
WizardLM-2-7B-abliterated-Q5_K_S.gguf |
Q5_K_S |
4.99GB |
High quality, recommended. |
WizardLM-2-7B-abliterated-Q4_K_M.gguf |
Q4_K_M |
4.36GB |
Good quality, uses about 4.83 bits per weight, recommended. |
WizardLM-2-7B-abliterated-Q4_K_S.gguf |
Q4_K_S |
4.14GB |
Slightly lower quality with more space savings, recommended. |
WizardLM-2-7B-abliterated-IQ4_NL.gguf |
IQ4_NL |
4.12GB |
Decent quality, slightly smaller than Q4_K_S with similar performance recommended. |
WizardLM-2-7B-abliterated-IQ4_XS.gguf |
IQ4_XS |
3.90GB |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
WizardLM-2-7B-abliterated-Q3_K_L.gguf |
Q3_K_L |
3.82GB |
Lower quality but usable, good for low RAM availability. |
WizardLM-2-7B-abliterated-Q3_K_M.gguf |
Q3_K_M |
3.51GB |
Even lower quality. |
WizardLM-2-7B-abliterated-IQ3_M.gguf |
IQ3_M |
3.28GB |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
WizardLM-2-7B-abliterated-IQ3_S.gguf |
IQ3_S |
3.18GB |
Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance. |
WizardLM-2-7B-abliterated-Q3_K_S.gguf |
Q3_K_S |
3.16GB |
Low quality, not recommended. |
WizardLM-2-7B-abliterated-IQ3_XS.gguf |
IQ3_XS |
3.01GB |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
WizardLM-2-7B-abliterated-IQ3_XXS.gguf |
IQ3_XXS |
2.82GB |
Lower quality, new method with decent performance, comparable to Q3 quants. |
WizardLM-2-7B-abliterated-Q2_K.gguf |
Q2_K |
2.71GB |
Very low quality but surprisingly usable. |
WizardLM-2-7B-abliterated-IQ2_M.gguf |
IQ2_M |
2.50GB |
Very low quality, uses SOTA techniques to also be surprisingly usable. |
WizardLM-2-7B-abliterated-IQ2_S.gguf |
IQ2_S |
2.31GB |
Very low quality, uses SOTA techniques to be usable. |
WizardLM-2-7B-abliterated-IQ2_XS.gguf |
IQ2_XS |
2.19GB |
Very low quality, uses SOTA techniques to be usable. |
WizardLM-2-7B-abliterated-IQ2_XXS.gguf |
IQ2_XXS |
1.99GB |
Lower quality, uses SOTA techniques to be usable. |
WizardLM-2-7B-abliterated-IQ1_M.gguf |
IQ1_M |
1.75GB |
Extremely low quality, not recommended. |
WizardLM-2-7B-abliterated-IQ1_S.gguf |
IQ1_S |
1.61GB |
Extremely low quality, not recommended. |
Choosing the Right File
A great write - up with charts showing various performances is provided by Artefact2 here
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1 - 2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1 - 2GB smaller than that total.
Next, you'll need to decide if you want to use an 'I - quant' or a 'K - quant'.
If you don't want to think too much, grab one of the K - quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
llama.cpp feature matrix
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I - quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I - quants can also be used on CPU and Apple Metal, but will be slower than their K - quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I - quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double - check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
đ License
This project is licensed under the Apache 2.0 license.