🚀 Llamacpp imatrix Quantizations of gemma-3-4b-it-qat by google
These quantizations are derived from the QAT (quantized aware training) weights provided by Google. They offer different levels of quality and performance for various use cases.
🚀 Quick Start
Run in LM Studio
You can run these quantized models in LM Studio.
Run with llama.cpp
Run them directly with llama.cpp, or any other llama.cpp based project.
Prompt Format
<bos><start_of_turn>user
{system_prompt}
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model
✨ Features
- Multiple Quantization Types: Offer a wide range of quantization types, such as Q4_0, Q8_0, Q6_K_L, etc., to meet different quality and performance requirements.
- Online Repacking: Some quantizations like Q4_0 support online repacking, which can automatically optimize weights for ARM and AVX CPU inference.
- Flexible Deployment: Can be run in LM Studio or directly with llama.cpp and other llama.cpp - based projects.
📦 Installation
Prerequisites
Make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Download a Specific File
huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q4_K_M.gguf" --local-dir ./
Download Split Files
If the model is bigger than 50GB and split into multiple files, run:
huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q8_0/*" --local-dir ./
You can either specify a new local - dir (e.g., google_gemma-3-4b-it-qat-Q8_0
) or download them all in place (./
).
💻 Usage Examples
Downloading Files
import os
os.system('huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q4_K_M.gguf" --local-dir ./')
os.system('huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q8_0/*" --local-dir ./')
📚 Documentation
Quantization Files Information
Filename |
Quant type |
File Size |
Split |
Description |
gemma-3-4b-it-qat-bf16.gguf |
bf16 |
7.77GB |
false |
Full BF16 weights. |
gemma-3-4b-it-qat-Q8_0.gguf |
Q8_0 |
4.13GB |
false |
Extremely high quality, generally unneeded but max available quant. |
gemma-3-4b-it-qat-Q6_K_L.gguf |
Q6_K_L |
3.35GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
gemma-3-4b-it-qat-Q6_K.gguf |
Q6_K |
3.19GB |
false |
Very high quality, near perfect, recommended. |
gemma-3-4b-it-qat-Q5_K_L.gguf |
Q5_K_L |
2.99GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
gemma-3-4b-it-qat-Q5_K_M.gguf |
Q5_K_M |
2.83GB |
false |
High quality, recommended. |
gemma-3-4b-it-qat-Q5_K_S.gguf |
Q5_K_S |
2.76GB |
false |
High quality, recommended. |
gemma-3-4b-it-qat-Q4_K_L.gguf |
Q4_K_L |
2.65GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
gemma-3-4b-it-qat-Q4_1.gguf |
Q4_1 |
2.56GB |
false |
Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
gemma-3-4b-it-qat-Q4_K_M.gguf |
Q4_K_M |
2.49GB |
false |
Good quality, default size for most use cases, recommended. |
gemma-3-4b-it-qat-Q3_K_XL.gguf |
Q3_K_XL |
2.40GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
gemma-3-4b-it-qat-Q4_K_S.gguf |
Q4_K_S |
2.38GB |
false |
Slightly lower quality with more space savings, recommended. |
gemma-3-4b-it-qat-Q4_0.gguf |
Q4_0 |
2.37GB |
false |
Legacy format, offers online repacking for ARM and AVX CPU inference. |
gemma-3-4b-it-qat-IQ4_NL.gguf |
IQ4_NL |
2.36GB |
false |
Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
gemma-3-4b-it-qat-IQ4_XS.gguf |
IQ4_XS |
2.26GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
gemma-3-4b-it-qat-Q3_K_L.gguf |
Q3_K_L |
2.24GB |
false |
Lower quality but usable, good for low RAM availability. |
gemma-3-4b-it-qat-Q3_K_M.gguf |
Q3_K_M |
2.10GB |
false |
Low quality. |
gemma-3-4b-it-qat-IQ3_M.gguf |
IQ3_M |
1.99GB |
false |
Medium - low quality, new method with decent performance comparable to Q3_K_M. |
gemma-3-4b-it-qat-Q3_K_S.gguf |
Q3_K_S |
1.94GB |
false |
Low quality, not recommended. |
gemma-3-4b-it-qat-Q2_K_L.gguf |
Q2_K_L |
1.89GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
gemma-3-4b-it-qat-IQ3_XS.gguf |
IQ3_XS |
1.86GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
gemma-3-4b-it-qat-Q2_K.gguf |
Q2_K |
1.73GB |
false |
Very low quality but surprisingly usable. |
gemma-3-4b-it-qat-IQ3_XXS.gguf |
IQ3_XXS |
1.69GB |
false |
Lower quality, new method with decent performance, comparable to Q3 quants. |
gemma-3-4b-it-qat-IQ2_M.gguf |
IQ2_M |
1.54GB |
false |
Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
ARM/AVX Information
Previously, Q4_0_4_4/4_8/8_8 were used, which interleaved weights in memory for better performance on ARM and AVX machines. Now, some quantizations like Q4_0 support online repacking. As of llama.cpp build b4282, you should use Q4_0 instead of Q4_0_X_X.
Additionally, IQ4_NL can provide slightly better quality and also repack weights for ARM, though currently only for the 4_4 case. The loading time may be slower but it will result in an overall speed increase.
Which File to Choose
A great write - up with charts showing various performances is provided by Artefact2 here. The first thing to figure out is... (The original text seems incomplete here, but we keep it as it is)
🔧 Technical Details
Quantization Process
These quantizations are made using the imatrix option with a dataset from here. The llama.cpp
release b5147 is used for quantization.
Online Repacking
Online repacking is a feature introduced in this PR. It can automatically optimize weights for ARM and AVX CPU inference.
📄 License
The license for this project is gemma
. To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.
Property |
Details |
Quantized By |
bartowski |
Pipeline Tag |
image - text - to - text |
Tags |
gemma3, gemma, google |
License |
gemma |
Base Model |
https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-unquantized |
Base Model Relation |
quantized |
⚠️ Important Note
As of llama.cpp build b4282, you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.
💡 Usage Tip
If you want to get slightly better quality for ARM, you can use IQ4_NL thanks to this PR, though the loading time may be slower but it will result in an overall speed increase.