Google Gemma-3-27B-IT-QAT-GGUF Open Source Model - Multi-level Quantization for Hardware Adaptation to Meet Diverse Needs

Google Gemma 3 27b It Qat GGUF

Developed by bartowski

A quantized version based on Google Gemma 3's 27-billion parameter instruction-tuned model, generated using quantization-aware training (QAT) weights, supporting multiple quantization levels to meet different hardware requirements.

Large Language Model #Quantization-aware training #Multi-round dialogue optimization #ARM/AVX compatibility

Downloads 14.97k

Release Time : 4/18/2025

Model Overview

This model is a quantized version of the Google Gemma-3-27B instruction-tuned model, specifically optimized for edge devices and resource-constrained environments, supporting ARM and AVX CPU architectures.

Model Features

Quantization-aware training optimization

Generated based on Google's official QAT weights, the Q4_0 version is expected to perform better

Support for multiple quantization levels

Provides 20 quantization options from Q2 to Q8 to adapt to different hardware configurations and performance requirements

ARM/AVX CPU optimization

Supports CPUs with ARM architecture and AVX instruction sets, enabling efficient online weight reorganization inference

imatrix quantization technology

Uses the imatrix option of llama.cpp for quantization to improve model quality

Model Capabilities

Text generation

Instruction following

Dialogue system

Content creation

Use Cases

Dialogue system

Intelligent assistant

Build a conversational AI assistant with quick response and accurate understanding

Achieve a smooth dialogue experience on resource-constrained devices

Content generation

Creative writing

Generate creative content such as stories and poems

Reduce hardware resource consumption while maintaining creativity

🚀 Llamacpp imatrix Quantizations of gemma-3-27b-it-qat by google

These are quantized versions of Google's gemma-3-27b-it-qat model, derived from the QAT (quantized aware training) weights provided by Google. Only the Q4_0 quantization is expected to offer improved performance, but other quantization types were also created for experimentation.

✨ Features

Multiple Quantization Types: Offers a wide range of quantization types (e.g., Q4_0, Q6_K, Q5_K, etc.) to suit different hardware and performance requirements.
Online Repacking: Q4_0 supports online repacking for ARM and AVX CPU inference, improving performance on compatible hardware.
Easy to Use: Can be run in LM Studio or directly with llama.cpp and other llama.cpp - based projects.

📦 Installation

Prerequisites

First, make sure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Downloading a Specific File

You can target the specific file you want:

huggingface-cli download bartowski/google_gemma-3-27b-it-qat-GGUF --include "google_gemma-3-27b-it-qat-Q4_K_M.gguf" --local-dir ./

Downloading Split Files

If the model is bigger than 50GB and split into multiple files, run:

huggingface-cli download bartowski/google_gemma-3-27b-it-qat-GGUF --include "google_gemma-3-27b-it-qat-Q8_0/*" --local-dir ./

You can either specify a new local - dir (e.g., google_gemma-3-27b-it-qat-Q8_0) or download them all in place (./).

💻 Usage Examples

Running in LM Studio

You can run the quantized models in LM Studio.

Running with llama.cpp

Run the models directly with llama.cpp, or any other llama.cpp based project.

Prompt Format

<bos><start_of_turn>user
{system_prompt}

{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model

📚 Documentation

Model Quantization Details

These quantizations are derived from the QAT weights provided by Google. Some of the quants (e.g., Q3_K_XL, Q4_K_L) use Q8_0 for embed and output weights instead of the default quantization.

Downloadable Files

Filename	Quant type	File Size	Split	Description
gemma-3-27b-it-qat-Q4_0.gguf	Q4_0	15.62GB	false	Legacy format, offers online repacking for ARM and AVX CPU inference.
gemma-3-27b-it-qat-Q8_0.gguf	Q8_0	28.71GB	false	Extremely high quality, generally unneeded but max available quant.
... (other files as in the original)	...	...	...	...

ARM/AVX Information

Previously, Q4_0_4_4/4_8/8_8 were used with interleaved weights for better performance on ARM and AVX machines. Now, "online repacking" is available for Q4_0 weights. As of llama.cpp build b4282, Q4_0_X_X files are no longer supported, and Q4_0 should be used instead. Additionally, IQ4_NL can provide slightly better quality and also repacks weights for ARM.

Which File to Choose

A detailed guide on choosing the right quantization file is provided. Consider your available RAM/VRAM, desired speed, and quality. You can choose between 'K - quants' (e.g., QX_K_X) and 'I - quants' (e.g., IQX_X). For more information, refer to the llama.cpp feature matrix.

🔧 Technical Details

The quantizations are based on the QAT weights from Google. The llama.cpp release b5147 is used for quantization, and all quants are made using the imatrix option with a dataset from here.

📄 License

The model is licensed under the gemma license. To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click the "Acknowledge license" button. Requests are processed immediately.

Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Thank you to LM Studio for sponsoring the work.

If you want to support the work, visit the ko - fi page: https://ko-fi.com/bartowski

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご