Google Gemma-3-4b-it-qat-GGUF Open Source Model - Multiple Quantization Levels, the Ideal Choice for Efficient Inference in Restricted Environments

Google Gemma 3 4b It Qat GGUF

Developed by bartowski

A quantized version of Google's Gemma 3B model based on QAT weights, supporting multiple quantization levels for efficient inference in resource-constrained environments.

Large Language Model #Quantization-Aware Training #ARM/AVX Optimization #Multi-turn Dialogue Model

Downloads 4,538

Release Time : 4/18/2025

Model Overview

This is a quantized version of Google's Gemma 3B model, generated using Quantization-Aware Training (QAT) technology and processed with llama.cpp's imatrix quantization. It offers various quantization options from BF16 to extremely low bit rates, making it particularly suitable for running on consumer-grade hardware.

Model Features

Quantization-Aware Training (QAT)

Generated from Google's official QAT weights, maintaining better model performance compared to post-training quantization.

Diverse Quantization Options

Offers 20+ quantization versions from BF16 to extremely low bit rates (Q2_K), catering to different hardware requirements.

ARM Architecture Optimization

Some quantized versions are specifically optimized for ARM processors and support online weight reorganization.

imatrix Quantization Enhancement

Uses llama.cpp's imatrix option for quantization, optimizing quantization effects based on specialized datasets.

Model Capabilities

Text Generation

Dialogue Systems

Instruction Following

Content Creation

Use Cases

Local AI Applications

Personal Assistant

Run an intelligent dialogue assistant on local devices.

Low-latency responses with privacy protection.

Content Creation

Assist with writing and creative generation.

High-quality text output.

Research & Development

Quantization Technology Research

Compare the impact of different quantization methods on model performance.

Provides multiple quantized versions for comparison.

🚀 Llamacpp imatrix Quantizations of gemma-3-4b-it-qat by google

These quantizations are derived from the QAT (quantized aware training) weights provided by Google. They offer different levels of quality and performance for various use cases.

🚀 Quick Start

Run in LM Studio

You can run these quantized models in LM Studio.

Run with llama.cpp

Run them directly with llama.cpp, or any other llama.cpp based project.

Prompt Format

<bos><start_of_turn>user
{system_prompt}

{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model

✨ Features

Multiple Quantization Types: Offer a wide range of quantization types, such as Q4_0, Q8_0, Q6_K_L, etc., to meet different quality and performance requirements.
Online Repacking: Some quantizations like Q4_0 support online repacking, which can automatically optimize weights for ARM and AVX CPU inference.
Flexible Deployment: Can be run in LM Studio or directly with llama.cpp and other llama.cpp - based projects.

📦 Installation

Prerequisites

Make sure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Download a Specific File

huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q4_K_M.gguf" --local-dir ./

Download Split Files

If the model is bigger than 50GB and split into multiple files, run:

huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q8_0/*" --local-dir ./

You can either specify a new local - dir (e.g., google_gemma-3-4b-it-qat-Q8_0) or download them all in place (./).

💻 Usage Examples

Downloading Files

# Download a specific file
import os
os.system('huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q4_K_M.gguf" --local-dir ./')

# Download split files
os.system('huggingface-cli download bartowski/google_gemma-3-4b-it-qat-GGUF --include "google_gemma-3-4b-it-qat-Q8_0/*" --local-dir ./')

📚 Documentation

Quantization Files Information

Filename	Quant type	File Size	Split	Description
gemma-3-4b-it-qat-bf16.gguf	bf16	7.77GB	false	Full BF16 weights.
gemma-3-4b-it-qat-Q8_0.gguf	Q8_0	4.13GB	false	Extremely high quality, generally unneeded but max available quant.
gemma-3-4b-it-qat-Q6_K_L.gguf	Q6_K_L	3.35GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
gemma-3-4b-it-qat-Q6_K.gguf	Q6_K	3.19GB	false	Very high quality, near perfect, recommended.
gemma-3-4b-it-qat-Q5_K_L.gguf	Q5_K_L	2.99GB	false	Uses Q8_0 for embed and output weights. High quality, recommended.
gemma-3-4b-it-qat-Q5_K_M.gguf	Q5_K_M	2.83GB	false	High quality, recommended.
gemma-3-4b-it-qat-Q5_K_S.gguf	Q5_K_S	2.76GB	false	High quality, recommended.
gemma-3-4b-it-qat-Q4_K_L.gguf	Q4_K_L	2.65GB	false	Uses Q8_0 for embed and output weights. Good quality, recommended.
gemma-3-4b-it-qat-Q4_1.gguf	Q4_1	2.56GB	false	Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon.
gemma-3-4b-it-qat-Q4_K_M.gguf	Q4_K_M	2.49GB	false	Good quality, default size for most use cases, recommended.
gemma-3-4b-it-qat-Q3_K_XL.gguf	Q3_K_XL	2.40GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
gemma-3-4b-it-qat-Q4_K_S.gguf	Q4_K_S	2.38GB	false	Slightly lower quality with more space savings, recommended.
gemma-3-4b-it-qat-Q4_0.gguf	Q4_0	2.37GB	false	Legacy format, offers online repacking for ARM and AVX CPU inference.
gemma-3-4b-it-qat-IQ4_NL.gguf	IQ4_NL	2.36GB	false	Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference.
gemma-3-4b-it-qat-IQ4_XS.gguf	IQ4_XS	2.26GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
gemma-3-4b-it-qat-Q3_K_L.gguf	Q3_K_L	2.24GB	false	Lower quality but usable, good for low RAM availability.
gemma-3-4b-it-qat-Q3_K_M.gguf	Q3_K_M	2.10GB	false	Low quality.
gemma-3-4b-it-qat-IQ3_M.gguf	IQ3_M	1.99GB	false	Medium - low quality, new method with decent performance comparable to Q3_K_M.
gemma-3-4b-it-qat-Q3_K_S.gguf	Q3_K_S	1.94GB	false	Low quality, not recommended.
gemma-3-4b-it-qat-Q2_K_L.gguf	Q2_K_L	1.89GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
gemma-3-4b-it-qat-IQ3_XS.gguf	IQ3_XS	1.86GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
gemma-3-4b-it-qat-Q2_K.gguf	Q2_K	1.73GB	false	Very low quality but surprisingly usable.
gemma-3-4b-it-qat-IQ3_XXS.gguf	IQ3_XXS	1.69GB	false	Lower quality, new method with decent performance, comparable to Q3 quants.
gemma-3-4b-it-qat-IQ2_M.gguf	IQ2_M	1.54GB	false	Relatively low quality, uses SOTA techniques to be surprisingly usable.

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

ARM/AVX Information

Previously, Q4_0_4_4/4_8/8_8 were used, which interleaved weights in memory for better performance on ARM and AVX machines. Now, some quantizations like Q4_0 support online repacking. As of llama.cpp build b4282, you should use Q4_0 instead of Q4_0_X_X.

Additionally, IQ4_NL can provide slightly better quality and also repack weights for ARM, though currently only for the 4_4 case. The loading time may be slower but it will result in an overall speed increase.

Which File to Choose

A great write - up with charts showing various performances is provided by Artefact2 here. The first thing to figure out is... (The original text seems incomplete here, but we keep it as it is)

🔧 Technical Details

Quantization Process

These quantizations are made using the imatrix option with a dataset from here. The llama.cpp release b5147 is used for quantization.

Online Repacking

Online repacking is a feature introduced in this PR. It can automatically optimize weights for ARM and AVX CPU inference.

📄 License

The license for this project is gemma. To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.

Property	Details
Quantized By	bartowski
Pipeline Tag	image - text - to - text
Tags	gemma3, gemma, google
License	gemma
Base Model	https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-unquantized
Base Model Relation	quantized

⚠️ Important Note

As of llama.cpp build b4282, you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.

💡 Usage Tip

If you want to get slightly better quality for ARM, you can use IQ4_NL thanks to this PR, though the loading time may be slower but it will result in an overall speed increase.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご