Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Gemma 3 model card
Gemma 3 is a multimodal model from Google, capable of handling text and image input and generating text output. It has a large context window, multilingual support, and is suitable for various text generation and image understanding tasks.
Model Page: Gemma
⚠️ Important Note
This repository corresponds to the 12B instruction-tuned version of the Gemma 3 model in GGUF format using Quantization Aware Training (QAT). The GGUF corresponds to Q4_0 quantization. Thanks to QAT, the model is able to preserve similar quality as
bfloat16
while significantly reducing the memory requirements to load the model. You can find the half-precision version here.
Resources and Technical Documentation:
- Gemma 3 Technical Report
- Responsible Generative AI Toolkit
- Gemma on Kaggle
- Gemma on Vertex Model Garden
Terms of Use: Terms
Authors: Google DeepMind
✨ Features
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Key features include:
- A large, 128K context window.
- Multilingual support in over 140 languages.
- Availability in more sizes than previous versions.
- Suitability for a variety of text generation and image understanding tasks, such as question answering, summarization, and reasoning.
- Ability to be deployed in environments with limited resources like laptops, desktops, or personal cloud infrastructure.
📦 Installation
To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.
Property | Details |
---|---|
Model Type | gemma |
Training Data | A diverse dataset including web documents, code, mathematics, and images |
💻 Usage Examples
Basic Usage
llama.cpp (text-only)
./llama-cli -hf google/gemma-3-12b-it-qat-q4_0-gguf -p "Write a poem about the Kraken."
llama.cpp (image input)
wget https://github.com/bebechien/gemma/blob/main/surprise.png?raw=true -O ~/Downloads/surprise.png
./llama-gemma3-cli -hf google/gemma-3-12b-it-qat-q4_0-gguf -p "Describe this image." --image ~/Downloads/surprise.png
ollama (text only)
Using GGUFs with Ollama via Hugging Face does not support image inputs at the moment. Please check the docs on running gated repositories.
ollama run hf.co/google/gemma-3-12b-it-qat-q4_0-gguf
📚 Documentation
Model Information
Description
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. They are well-suited for a variety of text generation and image understanding tasks, and their relatively small size allows for deployment in resource-limited environments.
Inputs and outputs
- Input:
- Text string, such as a question, a prompt, or a document to be summarized.
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each.
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
- Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document.
- Total output context of 8192 tokens.
Model Data
Training Dataset
These models were trained on a diverse dataset of text data from various sources. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens. Key components include:
- Web Documents: A diverse collection of web text in over 140 languages, exposing the model to a broad range of linguistic styles, topics, and vocabulary.
- Code: Helps the model learn programming language syntax and patterns, improving its code generation and understanding abilities.
- Mathematics: Enables the model to learn logical reasoning, symbolic representation, and address mathematical queries.
- Images: Allows the model to perform image analysis and visual data extraction tasks.
Data Preprocessing
Key data cleaning and filtering methods applied to the training data include:
- CSAM Filtering: Rigorous filtering at multiple stages to exclude harmful and illegal content.
- Sensitive Data Filtering: Automated techniques to filter out personal information and other sensitive data.
- Additional methods: Filtering based on content quality and safety in line with our policies.
Implementation Information
Hardware
Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e). TPUs offer several advantages for training vision-language models:
- Performance: Specifically designed to handle massive computations, speeding up training compared to CPUs.
- Memory: Often come with large amounts of high-bandwidth memory, allowing for handling large models and batch sizes during training, leading to better model quality.
- Scalability: TPU Pods provide a scalable solution for handling large foundation models, enabling distributed training across multiple devices.
- Cost-effectiveness: Can be a more cost-effective solution for training large models, especially considering time and resource savings.
- Alignment with Google's commitments to operate sustainably.
Software
Training was done using JAX and ML Pathways. JAX allows for faster and more efficient training of large models on the latest hardware, including TPUs. ML Pathways is Google's effort to build artificially intelligent systems capable of generalizing across multiple tasks, suitable for foundation models.
Evaluation
Benchmark Results
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Reasoning and factuality
Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|---|
HellaSwag | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 |
BoolQ | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 |
PIQA | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 |
SocialIQA | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 |
TriviaQA | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 |
Natural Questions | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 |
ARC-c | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 |
ARC-e | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 |
WinoGrande | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 |
BIG-Bench Hard | few-shot | 28.4 | 50.9 | 72.6 | 77.7 |
DROP | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 |
STEM and code
Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
MMLU | 5-shot | 59.6 | 74.5 | 78.6 |
MMLU (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 |
AGIEval | 3 - 5-shot | 42.1 | 57.4 | 66.2 |
MATH | 4-shot | 24.2 | 43.3 | 50.0 |
GSM8K | 8-shot | 38.4 | 71.0 | 82.6 |
GPQA | 5-shot | 15.0 | 25.4 | 24.3 |
MBPP | 3-shot | 46.0 | 60.4 | 65.6 |
HumanEval | 0-shot | 36.0 | 45.7 | 48.8 |
Multilingual
Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
MGSM | 2.04 | 34.7 | 64.3 | 74.3 |
Global-MMLU-Lite | 24.9 | 57.0 | 69.4 | 75.7 |
WMT24++ (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 |
FloRes | 29.5 | 39.2 | 46.0 | 48.8 |
XQuAD (all) | 43.9 | 68.0 | 74.5 | 76.8 |
ECLeKTic | 4.69 | 11.0 | 17.2 | 24.4 |
IndicGenBench | 41.4 | 57.2 | 61.7 | 63.4 |
Multimodal
Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|
COCOcap | 102 | 111 | 116 |
DocVQA (val) | 72.8 | 82.3 | 85.6 |
InfoVQA (val) | 44.1 | 54.8 | 59.4 |
MMMU (pt) | 39.2 | 50.3 | 56.1 |
TextVQA (val) | 58.9 | 66.5 | 68.6 |
RealWorldQA | 45.5 | 52.2 | 53.9 |
ReMI | 27.3 | 38.5 | 44.8 |
AI2D | 63.2 | 75.2 | 79.0 |
ChartQA | 63.6 | 74.7 | 76.3 |
VQAv2 | 63.9 | 71.2 | 72.9 |
BLINK | 38.0 | 35.9 | 39.6 |
OKVQA | 51.0 | 58.7 | 60.2 |
TallyQA | 42.5 | 51.8 | 54.3 |
SpatialSense VQA | 50.9 | 60.0 | 59.4 |
CountBenchQA | 26.1 | 17.8 | 68.0 |
Ethics and Safety
Evaluation Approach
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by multiple teams with different goals and human evaluation metrics. Categories evaluated for ethics and safety include:
- Child Safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
- Content Safety: Evaluation of text-to-text and image-to-text prompts covering safety policies such as harassment, violence, gore, and hate speech.
- Representational Harms: Evaluation of text-to-text and image-to-text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.
In addition to development level evaluations, we conduct "assurance evaluations" for responsibility governance decision making. These are separate from the model development team and inform release decisions. High-level findings are fed back to the model team, and prompt sets are held out to prevent overfitting.
Evaluation Results
For all areas of safety testing, we saw major improvements in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate model capabilities and behaviors. The model produced minimal policy violations and showed significant improvements over previous models' performance with respect to ungrounded inferences. A limitation of our evaluations was the use of only English language prompts.
Usage and Limitations
Intended Usage
Open vision-language models (VLMs) have a wide range of applications across various industries and domains. Potential uses include:
- Content Creation and Communication
- Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
- Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for research and algorithm development.
- Language Learning Tools: Support interactive language learning experiences.
- Knowledge Exploration: Assist in exploring large bodies of text.
Limitations
- Training Data: The quality and diversity of training data can influence model capabilities, and biases or gaps in the data can lead to limitations in responses.
- Context and Task Complexity: Models perform better on tasks with clear prompts, and performance can be affected by the amount of context provided.
- Language Ambiguity and Nuance: Natural language complexity can make it difficult for models to grasp subtle nuances.
- Factual Accuracy: Models may generate incorrect or outdated factual statements as they are not knowledge bases.
- Common Sense: Models may lack the ability to apply common sense reasoning in certain situations.
Citation
@article{gemma_2025,
title={Gemma 3},
url={https://goo.gle/Gemma3Report},
publisher={Kaggle},
author={Gemma Team},
year={2025}
}
📄 License
The license for this model is gemma.






