google_gemma-3-1b-it-qat-GGUF Open Source Model - Free Local Inference Deployment, Applicable for Multiple Quantized Versions

Google Gemma 3 1b It Qat GGUF

Developed by bartowski

Multiple quantized versions based on Google Gemma 3B QAT weights, suitable for local inference deployment

Large Language Model #Quantization Model Optimization #Lightweight Inference #Multi-precision Adaptation

Downloads 1,437

Release Time : 4/19/2025

Model Overview

This model is a collection of quantized versions of the Google Gemma-3-1B instruction-tuned model, optimized using llama.cpp's imatrix method for quantization, supporting multiple precision levels to adapt to different hardware environments

Model Features

Quantization-Aware Training Optimization

Based on Google's official QAT weights, offering better precision retention compared to traditional quantization methods

Multi-precision Options

Provides 20 quantization options from BF16 to 2bit to meet different hardware requirements

ARM Compatibility

Specific quantized versions (e.g., Q4_0) support online repacking inference on ARM CPUs

imatrix Optimization

Uses llama.cpp's imatrix feature for data-aware quantization, improving low-bit quantization quality

Model Capabilities

Instruction Following

Multi-turn Dialogue

Text Completion

Knowledge Q&A

Use Cases

Local Deployment Applications

Personal Assistant

Run a personalized AI assistant on local devices

Low-latency responses with privacy protection

Educational Tool

Offline learning tutoring and Q&A system

Edge Computing

Mobile Inference

Run AI features on mobile devices like smartphones

Optimized quantized models reduce hardware requirements

🚀 Llamacpp imatrix Quantizations of gemma-3-1b-it-qat by google

This project provides quantized versions of the Google's gemma-3-1b-it-qat model. These quantizations are derived from the QAT (quantized aware training) weights provided by Google. Only the Q4_0 quantization is expected to be better, but other quantizations are also provided to explore different performance scenarios.

Key Information

Quantized By: bartowski
Pipeline Tag: text-generation
Tags: gemma3, gemma, google
License: gemma
Base Model: google/gemma-3-1b-it-qat-q4_0-unquantized

Access Information

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click the button below. Requests are processed immediately.

✨ Features

Multiple Quantization Types: Offers a wide range of quantization types (e.g., Q4_0, Q8_0, Q6_K_L, etc.) to meet different performance and quality requirements.
Online Repacking: Some quantizations (e.g., Q4_0) support online repacking for ARM and AVX CPU inference, which can improve performance.
Compatibility: Can be run in LM Studio or directly with llama.cpp, or any other llama.cpp based project.

📦 Installation

Prerequisites

Make sure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Downloading a Specific File

To download a specific file, use the following command:

huggingface-cli download bartowski/google_gemma-3-1b-it-qat-GGUF --include "google_gemma-3-1b-it-qat-Q4_K_M.gguf" --local-dir ./

Downloading Split Files

If the model is bigger than 50GB and split into multiple files, use the following command to download them all to a local folder:

huggingface-cli download bartowski/google_gemma-3-1b-it-qat-GGUF --include "google_gemma-3-1b-it-qat-Q8_0/*" --local-dir ./

💻 Usage Examples

Prompt Format

<bos><start_of_turn>user
{system_prompt}

{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model

Running in LM Studio

You can run the quantized models in LM Studio.

Running with llama.cpp

You can also run the models directly with llama.cpp, or any other llama.cpp based project.

📚 Documentation

Downloadable Files

Filename	Quant type	File Size	Split	Description
gemma-3-1b-it-qat-bf16.gguf	bf16	2.01GB	false	Full BF16 weights.
gemma-3-1b-it-qat-Q8_0.gguf	Q8_0	1.07GB	false	Extremely high quality, generally unneeded but max available quant.
gemma-3-1b-it-qat-Q6_K_L.gguf	Q6_K_L	1.01GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
... (other files)	...	...	...	...

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

ARM/AVX Information

Previously, Q4_0_4_4/4_8/8_8 were used, and their weights were interleaved in memory to improve performance on ARM and AVX machines. Now, "online repacking" for weights is available, details in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.

As of llama.cpp build b4282, you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.

Which File to Choose?

A great write up with charts showing various performances is provided by Artefact2 here. The first thing to figure out is how big a model you can handle in terms of memory and performance requirements.

🔧 Technical Details

The quantizations are created using the llama.cpp release b5147. All quants are made using the imatrix option with a dataset from here.

📄 License

This project is licensed under the gemma license.

⚠️ Important Note

To access Gemma on Hugging Face, you need to review and agree to Google’s usage license.

💡 Usage Tip

If you want to get slightly better quality, you can use IQ4_NL thanks to this PR, which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed increase.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご