đ Llamacpp imatrix Quantizations of glm-4-9b-chat-abliterated
This project provides quantized versions of the glm-4-9b-chat-abliterated
model using llama.cpp
. It offers various quantization types to suit different hardware and performance requirements.
Model Information
Property |
Details |
Base Model |
byroneverson/glm-4-9b-chat-abliterated |
Language |
zh, en |
Library Name |
transformers |
License |
other (glm-4) |
License Link |
https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE |
Pipeline Tag |
text-generation |
Tags |
glm, chatglm, thudm, chat, abliterated |
Quantized By |
bartowski |
đ Quick Start
Quantization Process
We used llama.cpp release b3634 for quantization. The original model can be found at https://huggingface.co/byroneverson/glm-4-9b-chat-abliterated. All quantizations were made using the imatrix option with a dataset from here.
Running the Model
You can run these quantized models in LM Studio.
đģ Usage Examples
Prompt Format
[gMASK] <sop> <|system|>
{system_prompt} <|user|>
{prompt} <|assistant|>
đĻ Installation
Downloading a Single File
First, make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Then, you can target the specific file you want:
huggingface-cli download bartowski/glm-4-9b-chat-abliterated-GGUF --include "glm-4-9b-chat-abliterated-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
huggingface-cli download bartowski/glm-4-9b-chat-abliterated-GGUF --include "glm-4-9b-chat-abliterated-Q8_0/*" --local-dir ./
You can either specify a new local-dir (e.g., glm-4-9b-chat-abliterated-Q8_0
) or download them all in place (./
).
đ Documentation
File Download Table
Filename |
Quant type |
File Size |
Split |
Description |
glm-4-9b-chat-abliterated-f16.gguf |
f16 |
18.81GB |
false |
Full F16 weights. |
glm-4-9b-chat-abliterated-Q8_0.gguf |
Q8_0 |
9.99GB |
false |
Extremely high quality, generally unneeded but max available quant. |
glm-4-9b-chat-abliterated-Q6_K_L.gguf |
Q6_K_L |
8.56GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
glm-4-9b-chat-abliterated-Q6_K.gguf |
Q6_K |
8.26GB |
false |
Very high quality, near perfect, recommended. |
glm-4-9b-chat-abliterated-Q5_K_L.gguf |
Q5_K_L |
7.53GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
glm-4-9b-chat-abliterated-Q5_K_M.gguf |
Q5_K_M |
7.14GB |
false |
High quality, recommended. |
glm-4-9b-chat-abliterated-Q4_K_L.gguf |
Q4_K_L |
6.71GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
glm-4-9b-chat-abliterated-Q5_K_S.gguf |
Q5_K_S |
6.69GB |
false |
High quality, recommended. |
glm-4-9b-chat-abliterated-Q4_K_M.gguf |
Q4_K_M |
6.25GB |
false |
Good quality, default size for most use cases, recommended. |
glm-4-9b-chat-abliterated-Q3_K_XL.gguf |
Q3_K_XL |
5.82GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
glm-4-9b-chat-abliterated-Q4_K_S.gguf |
Q4_K_S |
5.75GB |
false |
Slightly lower quality with more space savings, recommended. |
glm-4-9b-chat-abliterated-Q4_0.gguf |
Q4_0 |
5.47GB |
false |
Legacy format, generally not worth using over similarly sized formats |
glm-4-9b-chat-abliterated-Q4_0_8_8.gguf |
Q4_0_8_8 |
5.46GB |
false |
Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality. |
glm-4-9b-chat-abliterated-Q4_0_4_8.gguf |
Q4_0_4_8 |
5.46GB |
false |
Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality. |
glm-4-9b-chat-abliterated-Q4_0_4_4.gguf |
Q4_0_4_4 |
5.46GB |
false |
Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality. |
glm-4-9b-chat-abliterated-Q3_K_L.gguf |
Q3_K_L |
5.28GB |
false |
Lower quality but usable, good for low RAM availability. |
glm-4-9b-chat-abliterated-IQ4_XS.gguf |
IQ4_XS |
5.25GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
glm-4-9b-chat-abliterated-Q3_K_M.gguf |
Q3_K_M |
5.06GB |
false |
Low quality. |
glm-4-9b-chat-abliterated-IQ3_M.gguf |
IQ3_M |
4.81GB |
false |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
glm-4-9b-chat-abliterated-Q2_K_L.gguf |
Q2_K_L |
4.60GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
glm-4-9b-chat-abliterated-Q3_K_S.gguf |
Q3_K_S |
4.59GB |
false |
Low quality, not recommended. |
glm-4-9b-chat-abliterated-IQ3_XS.gguf |
IQ3_XS |
4.43GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
glm-4-9b-chat-abliterated-Q2_K.gguf |
Q2_K |
3.99GB |
false |
Very low quality but surprisingly usable. |
glm-4-9b-chat-abliterated-IQ2_M.gguf |
IQ2_M |
3.93GB |
false |
Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Embed/Output Weights
Some of these quantizations (e.g., Q3_K_XL, Q4_K_L) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default. Some users claim that this improves the quality, while others don't notice any difference. If you use these models, please comment with your findings. I'd like feedback to ensure these quantizations are actually useful.
Model Selection
A great write-up with charts showing various performances is provided by Artefact2 here.
The first step is to determine the size of the model you can run. You'll need to know how much RAM and/or VRAM you have.
- If you want your model to run as fast as possible, aim to fit the whole model on your GPU's VRAM. Choose a quantization with a file size 1-2GB smaller than your GPU's total VRAM.
- If you want the highest possible quality, add your system RAM and your GPU's VRAM together, then select a quantization with a file size 1-2GB smaller than that total.
Next, decide whether you want to use an 'I-quant' or a 'K-quant'.
- If you don't want to think too much, choose one of the K-quants (e.g., Q5_K_M).
- If you want more detailed information, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and using cuBLAS (Nvidia) or rocBLAS (AMD), consider the I-quants (e.g., IQ3_M). These are newer and offer better performance for their size.
Note that the I-quants can be used on CPU and Apple Metal, but they'll be slower than their K-quant equivalents. Also, the I-quants are not compatible with Vulcan (AMD). Make sure to double-check if you're using the rocBLAS build or the Vulcan build. At the time of writing, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
đ§ Technical Details
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
đ License
This project uses the glm-4
license. You can find the full license details here.
đ Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
If you'd like to support my work, visit my ko-fi page here.