Gemma 3-4B Open-Source Multimodal Model - Supports Text and Image Input, Suitable for Free Deployment in Resource-Constrained Environments

Gemma 3 4b It Qat Int4 Unquantized

Developed by google

Gemma 3 is a lightweight multimodal open model launched by Google, supporting text and image input and generating text output. The 4B version has undergone instruction tuning and quantization-aware training, making it suitable for deployment in resource-constrained environments.

Image-to-Text

Transformers

#Multimodal understanding #128K long context #Multilingual support

Downloads 541

Release Time : 4/9/2025

Model Overview

A lightweight multimodal model built on Gemini technology, supporting a 128K context window and over 140 languages, suitable for various tasks such as question answering, summarization, and reasoning.

Model Features

Multimodal processing ability

Supports simultaneous processing of text and image input, enabling cross-modal understanding and generation

Quantization-aware training

Adopts QAT technology, which can significantly reduce memory requirements while maintaining quality

Large context window

Supports a context length of 128K tokens, suitable for processing long documents and complex tasks

Multilingual support

The training data covers over 140 languages, with cross-lingual processing capabilities

Model Capabilities

Text generation

Image content analysis

Multilingual processing

Logical reasoning

Code understanding and generation

Mathematical problem solving

Document summarization

Use Cases

Content generation

Intelligent question answering system

Generate accurate answers based on text or image input

Achieved an accuracy of 82.4 in the BoolQ benchmark test

Document summarization

Automatically generate concise summaries of long documents

Educational assistance

Mathematical problem solving

Solve various mathematical problems and show the reasoning process

Achieved an accuracy of 82.6% in the GSM8K benchmark test

Programming teaching

Explain code logic and generate sample code

Achieved an accuracy of 48.8% in the HumanEval benchmark test

Visual understanding

Image description generation

Generate detailed text descriptions for input images

Scored 116 in the COCOcap benchmark test

Document information extraction

Extract key information from scanned documents

Achieved an accuracy of 85.6 in the DocVQA benchmark test

🚀 Gemma 3 model card

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. The Gemma 3 models are multimodal, capable of handling text and image input and generating text output.

🚀 Quick Start

This repository corresponds to the 4B instruction-tuned version of the Gemma 3 model using Quantization Aware Training (QAT).

Note:

The checkpoint in this repository is unquantized, please make sure to quantize with int4 with your favorite tool.
Thanks to QAT, the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model.

✨ Features

Multimodal Capability: Handles both text and image input, generating text output.
Large Context Window: Has a 128K context window, enabling it to process long sequences of text.
Multilingual Support: Supports over 140 languages.
Multiple Sizes Available: Comes in various sizes to suit different resource requirements.
Suitable for Diverse Tasks: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.

📦 Installation

No installation steps provided in the original document.

💻 Usage Examples

No code examples provided in the original document.

📚 Documentation

Model Information

Model Page: Gemma
Resources and Technical Documentation:
Terms of Use: Terms
Authors: Google DeepMind

Description

Gemma is a family of lightweight, state-of-the-art open models from Google. Built on the same research and technology as the Gemini models, Gemma 3 models are multimodal, handling text and image input and generating text output. They have open weights for both pre-trained and instruction-tuned variants. With a large 128K context window, multilingual support in over 140 languages, and more size options than previous versions, Gemma 3 models are suitable for a variety of text generation and image understanding tasks, such as question answering, summarization, and reasoning. Their relatively small size allows for deployment in resource-limited environments like laptops, desktops, or personal cloud infrastructure, promoting wider access to advanced AI models and fostering innovation.

Inputs and outputs

Property	Details
Input	- Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size
Output	- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Model Data

Training Dataset

These models were trained on a text dataset from diverse sources. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens. Key components include:

Web Documents: A diverse collection of web text exposes the model to a wide range of linguistic styles, topics, and vocabularies. The training dataset contains content in over 140 languages.
Code: Exposing the model to code helps it learn programming language syntax and patterns, improving its code generation and code-related question understanding abilities.
Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and handle mathematical queries.
Images: A wide variety of images enables the model to perform image analysis and visual data extraction tasks.

The combination of these diverse data sources is crucial for training a powerful multimodal model capable of handling various tasks and data formats.

Data Preprocessing

Key data cleaning and filtering methods applied to the training data include:

CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to exclude harmful and illegal content.
Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from the training sets to ensure the safety and reliability of Gemma pre-trained models.
Additional methods: Filtering based on content quality and safety in line with our policies.

Implementation Information

Hardware

Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e). Training vision-language models (VLMs) requires significant computational power. TPUs, designed for matrix operations common in machine learning, offer several advantages:

Performance: TPUs are designed to handle the massive computations involved in training VLMs, speeding up training compared to CPUs.
Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can improve model quality.
Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.
Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially considering the time and resources saved due to faster training. These advantages align with Google's commitments to operate sustainably.

Software

Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks, making it suitable for foundation models like these. Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models: "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."

Evaluation

⚠️ Important Note

The evaluation in this section corresponds to the original checkpoint, not the QAT checkpoint.

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:

Reasoning and factuality

Benchmark	Metric	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
HellaSwag	10-shot	62.3	77.2	84.2	85.6
BoolQ	0-shot	63.2	72.3	78.8	82.4
PIQA	0-shot	73.8	79.6	81.8	83.3
SocialIQA	0-shot	48.9	51.9	53.4	54.9
TriviaQA	5-shot	39.8	65.8	78.2	85.5
Natural Questions	5-shot	9.48	20.0	31.4	36.1
ARC-c	25-shot	38.4	56.2	68.9	70.6
ARC-e	0-shot	73.0	82.4	88.3	89.0
WinoGrande	5-shot	58.2	64.7	74.3	78.8
BIG-Bench Hard	few-shot	28.4	50.9	72.6	77.7
DROP	1-shot	42.4	60.1	72.2	77.2

STEM and code

Benchmark	Metric	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MMLU	5-shot	59.6	74.5	78.6
MMLU (Pro COT)	5-shot	29.2	45.3	52.2
AGIEval	3 - 5-shot	42.1	57.4	66.2
MATH	4-shot	24.2	43.3	50.0
GSM8K	8-shot	38.4	71.0	82.6
GPQA	5-shot	15.0	25.4	24.3
MBPP	3-shot	46.0	60.4	65.6
HumanEval	0-shot	36.0	45.7	48.8

Multilingual

Benchmark	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MGSM	2.04	34.7	64.3	74.3
Global-MMLU-Lite	24.9	57.0	69.4	75.7
WMT24++ (ChrF)	36.7	48.4	53.9	55.7
FloRes	29.5	39.2	46.0	48.8
XQuAD (all)	43.9	68.0	74.5	76.8
ECLeKTic	4.69	11.0	17.2	24.4
IndicGenBench	41.4	57.2	61.7	63.4

Multimodal

Benchmark	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
COCOcap	102	111	116
DocVQA (val)	72.8	82.3	85.6
InfoVQA (val)	44.1	54.8	59.4
MMMU (pt)	39.2	50.3	56.1
TextVQA (val)	58.9	66.5	68.6
RealWorldQA	45.5	52.2	53.9
ReMI	27.3	38.5	44.8
AI2D	63.2	75.2	79.0
ChartQA	63.6	74.7	76.3
VQAv2	63.9	71.2	72.9
BLINK	38.0	35.9	39.6
OKVQA	51.0	58.7	60.2
TallyQA	42.5	51.8	54.3
SpatialSense VQA	50.9	60.0	59.4
CountBenchQA	26.1	17.8	68.0

Ethics and Safety

Evaluation Approach

Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by multiple teams with different goals and human evaluation metrics. These models were evaluated against several ethics and safety categories, including:

Child Safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
Content Safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence, gore, and hate speech.
Representational Harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies.

In addition to development-level evaluations, we conduct "assurance evaluations" for responsibility governance decision-making. These evaluations are conducted separately from the model development team to inform release decisions. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision-making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of the release review.

Evaluation Results

For all safety testing areas, we observed significant improvements in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate the model's capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations and showed significant improvements over previous Gemma models' performance in terms of ungrounded inferences. A limitation of our evaluations was that they only included English language prompts.

Usage and Limitations

Intended Usage

Open vision-language models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not exhaustive but provides contextual information about the possible use-cases considered by the model creators during training and development:

Content Creation and Communication
- Text Generation: These models can generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text corpora, research papers, or reports.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

Training Data
- The quality and diversity of the training data significantly affect the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- The scope of the training dataset determines the subject areas the model can handle effectively.
Context and Task Complexity
- Models perform better on tasks with clear prompts and instructions. Open-ended or highly complex tasks may be challenging.
- A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
Language Ambiguity and Nuance
- Natural language is complex. Models may struggle to understand subtle nuances, sarcasm, or figurative language.
Factual Accuracy
- Models generate responses based on information from their training datasets but are not knowledge bases. They may generate incorrect or outdated factual statements.
Common Sense
- Models rely on statistical patterns in language and may lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

Bias and Fairness
- VLMs trained on large-scale real-world text and image data can reflect socio-cultural biases in the training material. These models have undergone careful scrutiny, input data pre-processing, and subsequent evaluations, as reported in this card.
Misinformation and Misuse
- VLMs can be misused to generate false, misleading, or harmful text.
- Guidelines for responsible use are provided with the model, see the Responsible Generative AI Toolkit.

🔧 Technical Details

Base Model: google/gemma-3-4b-it
License: gemma
Tags: gemma3, gemma, google
Pipeline Tag: image-text-to-text
Library Name: transformers

📄 License

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. Acknowledge license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご