🚀 Llamacpp imatrix Quantizations of gemma-3-1b-it-qat by google
This project provides quantized versions of the Google's gemma-3-1b-it-qat model. These quantizations are derived from the QAT (quantized aware training) weights provided by Google. Only the Q4_0 quantization is expected to be better, but other quantizations are also provided to explore different performance scenarios.
Key Information
- Quantized By: bartowski
- Pipeline Tag: text-generation
- Tags: gemma3, gemma, google
- License: gemma
- Base Model: google/gemma-3-1b-it-qat-q4_0-unquantized
Access Information
To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click the button below. Requests are processed immediately.
✨ Features
- Multiple Quantization Types: Offers a wide range of quantization types (e.g., Q4_0, Q8_0, Q6_K_L, etc.) to meet different performance and quality requirements.
- Online Repacking: Some quantizations (e.g., Q4_0) support online repacking for ARM and AVX CPU inference, which can improve performance.
- Compatibility: Can be run in LM Studio or directly with llama.cpp, or any other llama.cpp based project.
📦 Installation
Prerequisites
Make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Downloading a Specific File
To download a specific file, use the following command:
huggingface-cli download bartowski/google_gemma-3-1b-it-qat-GGUF --include "google_gemma-3-1b-it-qat-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is bigger than 50GB and split into multiple files, use the following command to download them all to a local folder:
huggingface-cli download bartowski/google_gemma-3-1b-it-qat-GGUF --include "google_gemma-3-1b-it-qat-Q8_0/*" --local-dir ./
💻 Usage Examples
Prompt Format
<bos><start_of_turn>user
{system_prompt}
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model
Running in LM Studio
You can run the quantized models in LM Studio.
Running with llama.cpp
You can also run the models directly with llama.cpp, or any other llama.cpp based project.
📚 Documentation
Downloadable Files
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
ARM/AVX Information
Previously, Q4_0_4_4/4_8/8_8 were used, and their weights were interleaved in memory to improve performance on ARM and AVX machines. Now, "online repacking" for weights is available, details in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.
As of llama.cpp build b4282, you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.
Which File to Choose?
A great write up with charts showing various performances is provided by Artefact2 here. The first thing to figure out is how big a model you can handle in terms of memory and performance requirements.
🔧 Technical Details
The quantizations are created using the llama.cpp release b5147. All quants are made using the imatrix option with a dataset from here.
📄 License
This project is licensed under the gemma license.
⚠️ Important Note
To access Gemma on Hugging Face, you need to review and agree to Google’s usage license.
💡 Usage Tip
If you want to get slightly better quality, you can use IQ4_NL thanks to this PR, which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed increase.