GLM-4-9B-Chat-Abliterated GGUF Open-Source Chat Model - Supports Chinese and English Conversations, Compatible with Diverse Hardware

Glm 4 9b Chat Abliterated GGUF

Developed by bartowski

A 9B-parameter chat model based on GLM-4 architecture, supporting Chinese and English dialogues, quantized for various hardware environments

Large Language Model Supports Multiple LanguagesOpen Source License:Other #Bilingual Chinese-English Dialogue #High-precision Quantization #Low Memory Usage

Downloads 2,676

Release Time : 4/25/2025

Model Overview

This is a 9B-parameter chat model based on the GLM-4 architecture, supporting both Chinese and English dialogues. The model has undergone multiple quantization processes, making it suitable for different hardware environments, especially ideal for running on resource-constrained devices.

Model Features

Multiple Quantization Versions

Offers various quantization versions from F16 to IQ2_M to meet different hardware requirements

Bilingual Chinese-English Support

Specially optimized for dialogue exchanges in both Chinese and English

Efficient Inference

Optimized to run efficiently on resource-constrained devices

imatrix Quantization

Uses llama.cpp's imatrix option for quantization to improve quality

Model Capabilities

Text Generation

Dialogue System

Bilingual Chinese-English Processing

Chat Applications

Use Cases

Intelligent Assistant

Daily Q&A

Answers various daily questions from users

Provides accurate and fluent responses

Language Learning

Assists in bilingual Chinese-English learning

Offers a natural language exchange experience

Embedded Applications

Localized Chat Applications

Deploys chat functionality on resource-constrained devices

Achieves smooth dialogue under limited resources

🚀 Llamacpp imatrix Quantizations of glm-4-9b-chat-abliterated

This project provides quantized versions of the glm-4-9b-chat-abliterated model using llama.cpp. It offers various quantization types to suit different hardware and performance requirements.

Model Information

Property	Details
Base Model	byroneverson/glm-4-9b-chat-abliterated
Language	zh, en
Library Name	transformers
License	other (glm-4)
License Link	https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE
Pipeline Tag	text-generation
Tags	glm, chatglm, thudm, chat, abliterated
Quantized By	bartowski

🚀 Quick Start

Quantization Process

We used llama.cpp release b3634 for quantization. The original model can be found at https://huggingface.co/byroneverson/glm-4-9b-chat-abliterated. All quantizations were made using the imatrix option with a dataset from here.

Running the Model

You can run these quantized models in LM Studio.

💻 Usage Examples

Prompt Format

[gMASK] <sop> <|system|> 
{system_prompt} <|user|> 
{prompt} <|assistant|>

📦 Installation

Downloading a Single File

First, make sure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download bartowski/glm-4-9b-chat-abliterated-GGUF --include "glm-4-9b-chat-abliterated-Q4_K_M.gguf" --local-dir ./

Downloading Split Files

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download bartowski/glm-4-9b-chat-abliterated-GGUF --include "glm-4-9b-chat-abliterated-Q8_0/*" --local-dir ./

You can either specify a new local-dir (e.g., glm-4-9b-chat-abliterated-Q8_0) or download them all in place (./).

📚 Documentation

File Download Table

Filename	Quant type	File Size	Split	Description
glm-4-9b-chat-abliterated-f16.gguf	f16	18.81GB	false	Full F16 weights.
glm-4-9b-chat-abliterated-Q8_0.gguf	Q8_0	9.99GB	false	Extremely high quality, generally unneeded but max available quant.
glm-4-9b-chat-abliterated-Q6_K_L.gguf	Q6_K_L	8.56GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
glm-4-9b-chat-abliterated-Q6_K.gguf	Q6_K	8.26GB	false	Very high quality, near perfect, recommended.
glm-4-9b-chat-abliterated-Q5_K_L.gguf	Q5_K_L	7.53GB	false	Uses Q8_0 for embed and output weights. High quality, recommended.
glm-4-9b-chat-abliterated-Q5_K_M.gguf	Q5_K_M	7.14GB	false	High quality, recommended.
glm-4-9b-chat-abliterated-Q4_K_L.gguf	Q4_K_L	6.71GB	false	Uses Q8_0 for embed and output weights. Good quality, recommended.
glm-4-9b-chat-abliterated-Q5_K_S.gguf	Q5_K_S	6.69GB	false	High quality, recommended.
glm-4-9b-chat-abliterated-Q4_K_M.gguf	Q4_K_M	6.25GB	false	Good quality, default size for most use cases, recommended.
glm-4-9b-chat-abliterated-Q3_K_XL.gguf	Q3_K_XL	5.82GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
glm-4-9b-chat-abliterated-Q4_K_S.gguf	Q4_K_S	5.75GB	false	Slightly lower quality with more space savings, recommended.
glm-4-9b-chat-abliterated-Q4_0.gguf	Q4_0	5.47GB	false	Legacy format, generally not worth using over similarly sized formats
glm-4-9b-chat-abliterated-Q4_0_8_8.gguf	Q4_0_8_8	5.46GB	false	Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality.
glm-4-9b-chat-abliterated-Q4_0_4_8.gguf	Q4_0_4_8	5.46GB	false	Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality.
glm-4-9b-chat-abliterated-Q4_0_4_4.gguf	Q4_0_4_4	5.46GB	false	Optimized for ARM and CPU inference, much faster than Q4_0 at similar quality.
glm-4-9b-chat-abliterated-Q3_K_L.gguf	Q3_K_L	5.28GB	false	Lower quality but usable, good for low RAM availability.
glm-4-9b-chat-abliterated-IQ4_XS.gguf	IQ4_XS	5.25GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
glm-4-9b-chat-abliterated-Q3_K_M.gguf	Q3_K_M	5.06GB	false	Low quality.
glm-4-9b-chat-abliterated-IQ3_M.gguf	IQ3_M	4.81GB	false	Medium-low quality, new method with decent performance comparable to Q3_K_M.
glm-4-9b-chat-abliterated-Q2_K_L.gguf	Q2_K_L	4.60GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
glm-4-9b-chat-abliterated-Q3_K_S.gguf	Q3_K_S	4.59GB	false	Low quality, not recommended.
glm-4-9b-chat-abliterated-IQ3_XS.gguf	IQ3_XS	4.43GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
glm-4-9b-chat-abliterated-Q2_K.gguf	Q2_K	3.99GB	false	Very low quality but surprisingly usable.
glm-4-9b-chat-abliterated-IQ2_M.gguf	IQ2_M	3.93GB	false	Relatively low quality, uses SOTA techniques to be surprisingly usable.

Embed/Output Weights

Some of these quantizations (e.g., Q3_K_XL, Q4_K_L) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default. Some users claim that this improves the quality, while others don't notice any difference. If you use these models, please comment with your findings. I'd like feedback to ensure these quantizations are actually useful.

Model Selection

A great write-up with charts showing various performances is provided by Artefact2 here.

The first step is to determine the size of the model you can run. You'll need to know how much RAM and/or VRAM you have.

If you want your model to run as fast as possible, aim to fit the whole model on your GPU's VRAM. Choose a quantization with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the highest possible quality, add your system RAM and your GPU's VRAM together, then select a quantization with a file size 1-2GB smaller than that total.

Next, decide whether you want to use an 'I-quant' or a 'K-quant'.

If you don't want to think too much, choose one of the K-quants (e.g., Q5_K_M).
If you want more detailed information, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and using cuBLAS (Nvidia) or rocBLAS (AMD), consider the I-quants (e.g., IQ3_M). These are newer and offer better performance for their size.

Note that the I-quants can be used on CPU and Apple Metal, but they'll be slower than their K-quant equivalents. Also, the I-quants are not compatible with Vulcan (AMD). Make sure to double-check if you're using the rocBLAS build or the Vulcan build. At the time of writing, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.

🔧 Technical Details

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

📄 License

This project uses the glm-4 license. You can find the full license details here.

👏 Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output.

If you'd like to support my work, visit my ko-fi page here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご