🚀 Llamacpp imatrix Quantizations of gemma-3-27b-it-qat by google
These are quantized versions of Google's gemma-3-27b-it-qat model, derived from the QAT (quantized aware training) weights provided by Google. Only the Q4_0 quantization is expected to offer improved performance, but other quantization types were also created for experimentation.
✨ Features
- Multiple Quantization Types: Offers a wide range of quantization types (e.g., Q4_0, Q6_K, Q5_K, etc.) to suit different hardware and performance requirements.
- Online Repacking: Q4_0 supports online repacking for ARM and AVX CPU inference, improving performance on compatible hardware.
- Easy to Use: Can be run in LM Studio or directly with llama.cpp and other llama.cpp - based projects.
📦 Installation
Prerequisites
First, make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Downloading a Specific File
You can target the specific file you want:
huggingface-cli download bartowski/google_gemma-3-27b-it-qat-GGUF --include "google_gemma-3-27b-it-qat-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is bigger than 50GB and split into multiple files, run:
huggingface-cli download bartowski/google_gemma-3-27b-it-qat-GGUF --include "google_gemma-3-27b-it-qat-Q8_0/*" --local-dir ./
You can either specify a new local - dir (e.g., google_gemma-3-27b-it-qat-Q8_0
) or download them all in place (./
).
💻 Usage Examples
Running in LM Studio
You can run the quantized models in LM Studio.
Running with llama.cpp
Run the models directly with llama.cpp, or any other llama.cpp based project.
Prompt Format
<bos><start_of_turn>user
{system_prompt}
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model
📚 Documentation
Model Quantization Details
These quantizations are derived from the QAT weights provided by Google. Some of the quants (e.g., Q3_K_XL, Q4_K_L) use Q8_0 for embed and output weights instead of the default quantization.
Downloadable Files
Filename |
Quant type |
File Size |
Split |
Description |
gemma-3-27b-it-qat-Q4_0.gguf |
Q4_0 |
15.62GB |
false |
Legacy format, offers online repacking for ARM and AVX CPU inference. |
gemma-3-27b-it-qat-Q8_0.gguf |
Q8_0 |
28.71GB |
false |
Extremely high quality, generally unneeded but max available quant. |
... (other files as in the original) |
... |
... |
... |
... |
ARM/AVX Information
Previously, Q4_0_4_4/4_8/8_8 were used with interleaved weights for better performance on ARM and AVX machines. Now, "online repacking" is available for Q4_0 weights. As of llama.cpp build b4282, Q4_0_X_X files are no longer supported, and Q4_0 should be used instead. Additionally, IQ4_NL can provide slightly better quality and also repacks weights for ARM.
Which File to Choose
A detailed guide on choosing the right quantization file is provided. Consider your available RAM/VRAM, desired speed, and quality. You can choose between 'K - quants' (e.g., QX_K_X) and 'I - quants' (e.g., IQX_X). For more information, refer to the llama.cpp feature matrix.
🔧 Technical Details
The quantizations are based on the QAT weights from Google. The llama.cpp
release b5147 is used for quantization, and all quants are made using the imatrix option with a dataset from here.
📄 License
The model is licensed under the gemma
license. To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click the "Acknowledge license" button. Requests are processed immediately.
Credits
- Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
- Thank you ZeroWw for the inspiration to experiment with embed/output.
- Thank you to LM Studio for sponsoring the work.
If you want to support the work, visit the ko - fi page: https://ko-fi.com/bartowski