Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Gemma 3 model card
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. The Gemma 3 models are multimodal, capable of handling text and image input and generating text output.
🚀 Quick Start
This repository corresponds to the 4B instruction-tuned version of the Gemma 3 model using Quantization Aware Training (QAT).
Note:
- The checkpoint in this repository is unquantized, please make sure to quantize with int4 with your favorite tool.
- Thanks to QAT, the model is able to preserve similar quality as
bfloat16
while significantly reducing the memory requirements to load the model.
✨ Features
- Multimodal Capability: Handles both text and image input, generating text output.
- Large Context Window: Has a 128K context window, enabling it to process long sequences of text.
- Multilingual Support: Supports over 140 languages.
- Multiple Sizes Available: Comes in various sizes to suit different resource requirements.
- Suitable for Diverse Tasks: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.
📦 Installation
No installation steps provided in the original document.
💻 Usage Examples
No code examples provided in the original document.
📚 Documentation
Model Information
- Model Page: Gemma
- Resources and Technical Documentation:
- Terms of Use: Terms
- Authors: Google DeepMind
Description
Gemma is a family of lightweight, state-of-the-art open models from Google. Built on the same research and technology as the Gemini models, Gemma 3 models are multimodal, handling text and image input and generating text output. They have open weights for both pre-trained and instruction-tuned variants. With a large 128K context window, multilingual support in over 140 languages, and more size options than previous versions, Gemma 3 models are suitable for a variety of text generation and image understanding tasks, such as question answering, summarization, and reasoning. Their relatively small size allows for deployment in resource-limited environments like laptops, desktops, or personal cloud infrastructure, promoting wider access to advanced AI models and fostering innovation.
Inputs and outputs
Property | Details |
---|---|
Input | - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size |
Output | - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens |
Citation
@article{gemma_2025,
title={Gemma 3},
url={https://goo.gle/Gemma3Report},
publisher={Kaggle},
author={Gemma Team},
year={2025}
}
Model Data
Training Dataset
These models were trained on a text dataset from diverse sources. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens. Key components include:
- Web Documents: A diverse collection of web text exposes the model to a wide range of linguistic styles, topics, and vocabularies. The training dataset contains content in over 140 languages.
- Code: Exposing the model to code helps it learn programming language syntax and patterns, improving its code generation and code-related question understanding abilities.
- Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and handle mathematical queries.
- Images: A wide variety of images enables the model to perform image analysis and visual data extraction tasks.
The combination of these diverse data sources is crucial for training a powerful multimodal model capable of handling various tasks and data formats.
Data Preprocessing
Key data cleaning and filtering methods applied to the training data include:
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to exclude harmful and illegal content.
- Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from the training sets to ensure the safety and reliability of Gemma pre-trained models.
- Additional methods: Filtering based on content quality and safety in line with our policies.
Implementation Information
Hardware
Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e). Training vision-language models (VLMs) requires significant computational power. TPUs, designed for matrix operations common in machine learning, offer several advantages:
- Performance: TPUs are designed to handle the massive computations involved in training VLMs, speeding up training compared to CPUs.
- Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can improve model quality.
- Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.
- Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially considering the time and resources saved due to faster training. These advantages align with Google's commitments to operate sustainably.
Software
Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks, making it suitable for foundation models like these. Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models: "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."
Evaluation
⚠️ Important Note
The evaluation in this section corresponds to the original checkpoint, not the QAT checkpoint.
Benchmark Results
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Reasoning and factuality
Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|---|
HellaSwag | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 |
BoolQ | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 |
PIQA | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 |
SocialIQA | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 |
TriviaQA | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 |
Natural Questions | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 |
ARC-c | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 |
ARC-e | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 |
WinoGrande | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 |
BIG-Bench Hard | few-shot | 28.4 | 50.9 | 72.6 | 77.7 |
DROP | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 |
STEM and code
Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
MMLU | 5-shot | 59.6 | 74.5 | 78.6 |
MMLU (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 |
AGIEval | 3 - 5-shot | 42.1 | 57.4 | 66.2 |
MATH | 4-shot | 24.2 | 43.3 | 50.0 |
GSM8K | 8-shot | 38.4 | 71.0 | 82.6 |
GPQA | 5-shot | 15.0 | 25.4 | 24.3 |
MBPP | 3-shot | 46.0 | 60.4 | 65.6 |
HumanEval | 0-shot | 36.0 | 45.7 | 48.8 |
Multilingual
Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
MGSM | 2.04 | 34.7 | 64.3 | 74.3 |
Global-MMLU-Lite | 24.9 | 57.0 | 69.4 | 75.7 |
WMT24++ (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 |
FloRes | 29.5 | 39.2 | 46.0 | 48.8 |
XQuAD (all) | 43.9 | 68.0 | 74.5 | 76.8 |
ECLeKTic | 4.69 | 11.0 | 17.2 | 24.4 |
IndicGenBench | 41.4 | 57.2 | 61.7 | 63.4 |
Multimodal
Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|
COCOcap | 102 | 111 | 116 |
DocVQA (val) | 72.8 | 82.3 | 85.6 |
InfoVQA (val) | 44.1 | 54.8 | 59.4 |
MMMU (pt) | 39.2 | 50.3 | 56.1 |
TextVQA (val) | 58.9 | 66.5 | 68.6 |
RealWorldQA | 45.5 | 52.2 | 53.9 |
ReMI | 27.3 | 38.5 | 44.8 |
AI2D | 63.2 | 75.2 | 79.0 |
ChartQA | 63.6 | 74.7 | 76.3 |
VQAv2 | 63.9 | 71.2 | 72.9 |
BLINK | 38.0 | 35.9 | 39.6 |
OKVQA | 51.0 | 58.7 | 60.2 |
TallyQA | 42.5 | 51.8 | 54.3 |
SpatialSense VQA | 50.9 | 60.0 | 59.4 |
CountBenchQA | 26.1 | 17.8 | 68.0 |
Ethics and Safety
Evaluation Approach
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by multiple teams with different goals and human evaluation metrics. These models were evaluated against several ethics and safety categories, including:
- Child Safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
- Content Safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence, gore, and hate speech.
- Representational Harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies.
In addition to development-level evaluations, we conduct "assurance evaluations" for responsibility governance decision-making. These evaluations are conducted separately from the model development team to inform release decisions. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision-making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of the release review.
Evaluation Results
For all safety testing areas, we observed significant improvements in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate the model's capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations and showed significant improvements over previous Gemma models' performance in terms of ungrounded inferences. A limitation of our evaluations was that they only included English language prompts.
Usage and Limitations
Intended Usage
Open vision-language models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not exhaustive but provides contextual information about the possible use-cases considered by the model creators during training and development:
- Content Creation and Communication
- Text Generation: These models can generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text corpora, research papers, or reports.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
- Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
Limitations
- Training Data
- The quality and diversity of the training data significantly affect the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- The scope of the training dataset determines the subject areas the model can handle effectively.
- Context and Task Complexity
- Models perform better on tasks with clear prompts and instructions. Open-ended or highly complex tasks may be challenging.
- A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
- Language Ambiguity and Nuance
- Natural language is complex. Models may struggle to understand subtle nuances, sarcasm, or figurative language.
- Factual Accuracy
- Models generate responses based on information from their training datasets but are not knowledge bases. They may generate incorrect or outdated factual statements.
- Common Sense
- Models rely on statistical patterns in language and may lack the ability to apply common sense reasoning in certain situations.
Ethical Considerations and Risks
The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
- Bias and Fairness
- VLMs trained on large-scale real-world text and image data can reflect socio-cultural biases in the training material. These models have undergone careful scrutiny, input data pre-processing, and subsequent evaluations, as reported in this card.
- Misinformation and Misuse
- VLMs can be misused to generate false, misleading, or harmful text.
- Guidelines for responsible use are provided with the model, see the Responsible Generative AI Toolkit.
🔧 Technical Details
- Base Model: google/gemma-3-4b-it
- License: gemma
- Tags: gemma3, gemma, google
- Pipeline Tag: image-text-to-text
- Library Name: transformers
📄 License
To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. Acknowledge license






