Model Overview
Model Features
Model Capabilities
Use Cases
đ Gemma 3 model card
Gemma 3 is a multimodal model from Google, capable of handling text and image input and generating text output. It has a large context window, multilingual support, and is available in various sizes, suitable for a wide range of text generation and image understanding tasks.
đ Quick Start
To quickly get started with the Gemma 3 model, you can refer to the official Gemma page.
⨠Features
- Multimodal Capability: Handles both text and image input, generating text output.
- Large Context Window: Up to 128K context window for the 4B, 12B, and 27B sizes, and 32K for the 1B size.
- Multilingual Support: Supports over 140 languages.
- Diverse Sizes: Available in multiple sizes to suit different resource requirements.
- Suitable for Various Tasks: Well - suited for text generation, image analysis, question answering, summarization, and reasoning tasks.
đ Documentation
Model Information
Description
Gemma is a family of lightweight, state - of - the - art open models from Google, built using the same research and technology as the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output. They have open weights for both pre - trained and instruction - tuned variants, a large 128K context window (32K for the 1B size), multilingual support in over 140 languages, and are available in more sizes than previous versions. These models are suitable for a variety of text generation and image understanding tasks, and their relatively small size allows for deployment in resource - limited environments.
Inputs and outputs
- Input:
- Text string, such as a question, a prompt, or a document to be summarized.
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each.
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
- Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document.
- Total output context of 8192 tokens.
Citation
@article{gemma_2025,
title={Gemma 3},
url={https://goo.gle/Gemma3Report},
publisher={Kaggle},
author={Gemma Team},
year={2025}
}
Model Data
Training Dataset
This particular model was trained with the openai/gsm8k dataset using the GRPO fine - tuning method. These models were trained on a diverse dataset of text data, including web documents, code, mathematics, and images. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens.
- Web Documents: Ensure the model is exposed to a broad range of linguistic styles, topics, and vocabulary, with content in over 140 languages.
- Code: Helps the model learn programming language syntax and patterns, improving its code generation and understanding abilities.
- Mathematics: Enables the model to learn logical reasoning, symbolic representation, and handle mathematical queries.
- Images: Allows the model to perform image analysis and visual data extraction tasks.
Data Preprocessing
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to exclude harmful and illegal content.
- Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from training sets to make Gemma pre - trained models safe and reliable.
- Additional methods: Filtering based on content quality and safety in line with [our policies][safety - policies].
Implementation Information
Hardware
Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p, and TPUv5e). TPUs offer several advantages for training vision - language models:
- Performance: Specifically designed to handle the massive computations involved in training VLMs, speeding up training compared to CPUs.
- Memory: Often come with large amounts of high - bandwidth memory, allowing for handling large models and batch sizes during training, which can lead to better model quality.
- Scalability: TPU Pods provide a scalable solution for handling the growing complexity of large foundation models, enabling distributed training across multiple TPU devices for faster and more efficient processing.
- Cost - effectiveness: In many scenarios, more cost - effective for training large models compared to CPU - based infrastructure, considering the time and resources saved due to faster training.
- Sustainability: Aligned with [Google's commitments to operate sustainably][sustainability].
Software
Training was done using [JAX][jax] and [ML Pathways][ml - pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks, which is suitable for foundation models like Gemma.
Evaluation
Benchmark Results
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Reasoning and factuality
Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|---|
[HellaSwag][hellaswag] | 10 - shot | 62.3 | 77.2 | 84.2 | 85.6 |
[BoolQ][boolq] | 0 - shot | 63.2 | 72.3 | 78.8 | 82.4 |
[PIQA][piqa] | 0 - shot | 73.8 | 79.6 | 81.8 | 83.3 |
[SocialIQA][socialiqa] | 0 - shot | 48.9 | 51.9 | 53.4 | 54.9 |
[TriviaQA][triviaqa] | 5 - shot | 39.8 | 65.8 | 78.2 | 85.5 |
[Natural Questions][naturalq] | 5 - shot | 9.48 | 20.0 | 31.4 | 36.1 |
[ARC - c][arc] | 25 - shot | 38.4 | 56.2 | 68.9 | 70.6 |
[ARC - e][arc] | 0 - shot | 73.0 | 82.4 | 88.3 | 89.0 |
[WinoGrande][winogrande] | 5 - shot | 58.2 | 64.7 | 74.3 | 78.8 |
[BIG - Bench Hard][bbh] | few - shot | 28.4 | 50.9 | 72.6 | 77.7 |
[DROP][drop] | 1 - shot | 42.4 | 60.1 | 72.2 | 77.2 |
STEM and code
Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
[MMLU][mmlu] | 5 - shot | 59.6 | 74.5 | 78.6 |
[MMLU][mmlu] (Pro COT) | 5 - shot | 29.2 | 45.3 | 52.2 |
[AGIEval][agieval] | 3 - 5 - shot | 42.1 | 57.4 | 66.2 |
[MATH][math] | 4 - shot | 24.2 | 43.3 | 50.0 |
[GSM8K][gsm8k] | 8 - shot | 38.4 | 71.0 | 82.6 |
[GPQA][gpqa] | 5 - shot | 15.0 | 25.4 | 24.3 |
[MBPP][mbpp] | 3 - shot | 46.0 | 60.4 | 65.6 |
[HumanEval][humaneval] | 0 - shot | 36.0 | 45.7 | 48.8 |
Multilingual
Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|---|
[MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 |
[Global - MMLU - Lite][global - mmlu - lite] | 24.9 | 57.0 | 69.4 | 75.7 |
[WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 |
[FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 |
[XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 |
[ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 |
[IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 |
Multimodal
Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
---|---|---|---|
[COCOcap][coco - cap] | 102 | 111 | 116 |
[DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 |
[InfoVQA][info - vqa] (val) | 44.1 | 54.8 | 59.4 |
[MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 |
[TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 |
[RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 |
[ReMI][remi] | 27.3 | 38.5 | 44.8 |
[AI2D][ai2d] | 63.2 | 75.2 | 79.0 |
[ChartQA][chartqa] | 63.6 | 74.7 | 76.3 |
[VQAv2][vqav2] | 63.9 | 71.2 | 72.9 |
[BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 |
[OKVQA][okvqa] | 51.0 | 58.7 | 60.2 |
[TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 |
[SpatialSense VQA][ss - vqa] | 50.9 | 60.0 | 59.4 |
[CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 |
Ethics and Safety
Evaluation Approach
Our evaluation methods include structured evaluations and internal red - teaming testing of relevant content policies. Red - teaming was conducted by multiple teams with different goals and human evaluation metrics. The models were evaluated against several ethics and safety categories:
- Child Safety: Evaluation of text - to - text and image - to - text prompts covering child safety policies, including child sexual abuse and exploitation.
- Content Safety: Evaluation of text - to - text and image - to - text prompts covering safety policies such as harassment, violence, gore, and hate speech.
- Representational Harms: Evaluation of text - to - text and image - to - text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.
In addition to development - level evaluations, we conduct "assurance evaluations" for responsibility governance decision - making. These evaluations are conducted separately from the model development team, and high - level findings are fed back to the model team while prompt sets are held out to prevent overfitting. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.
Evaluation Results
For all areas of safety testing, significant improvements were observed in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate the model's capabilities and behaviors. Across all model sizes and for both text - to - text and image - to - text, the model produced minimal policy violations and showed significant improvements over previous Gemma models' performance regarding ungrounded inferences. A limitation of our evaluations was that they included only English language prompts.
Usage and Limitations
Intended Usage
Open vision - language models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not exhaustive:
- Content Creation and Communication
- Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text corpora, research papers, or reports.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
- Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
Limitations
- Training Data: The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses, and the scope of the training dataset determines the subject areas the model can handle effectively.
- Context and Task Complexity: Models are better at tasks with clear prompts and instructions. Open - ended or highly complex tasks may be challenging, and a model's performance can be influenced by the amount of context provided.
- Language Ambiguity and Nuance: Natural language is complex, and models may struggle to grasp subtle nuances, sarcasm, or figurative language.
- Factual Accuracy: Models generate responses based on training data, but they are not knowledge bases and may generate incorrect or outdated factual statements.
- Common Sense: Models rely on statistical patterns in language and may lack the ability to apply common sense reasoning in certain situations.
Ethical Considerations and Risks
- Bias and Fairness: VLMs trained on large - scale, real - world text and image data can reflect socio - cultural biases in the training material. These models underwent careful scrutiny, input data pre - processing, and posterior evaluations.
- Misinformation and Misuse: VLMs can be misused to generate false, misleading, or harmful text. Guidelines for responsible use are provided in the [Responsible Generative AI Toolkit][rai - toolkit].
- Transparency and Accountability: This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
Risks identified and mitigations:
- Perpetuation of biases: Continuous monitoring (using evaluation metrics, human review) and exploration of de - biasing techniques during model training, fine - tuning, and other use cases are encouraged.
- Generation of harmful content: Mechanisms and guidelines for content safety are essential.
đ License
The model is under the [Gemma](https://example.com/gemma - license) license.
[g3 - tech - report]: https://example.com/gemma3 - technical - report [rai - toolkit]: https://example.com/responsible - generative - ai - toolkit [kaggle - gemma]: https://www.kaggle.com/gemma [vertex - mg - gemma3]: https://example.com/vertex - model - garden - gemma3 [terms]: https://example.com/terms - of - use [tpu]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit [jax]: https://github.com/google/jax [ml - pathways]: https://example.com/ml - pathways [gemini - 2 - paper]: https://example.com/gemini - 2 - paper [safety - policies]: https://example.com/safety - policies [sustainability]: https://example.com/sustainability - commitments [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google - research - datasets/natural - questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global - mmlu - lite]: https://huggingface.co/datasets/CohereForAI/Global - MMLU - Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 [coco - cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info - vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss - vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google - research/big_vision/blob/main/big_vision/datasets/countbenchqa/







