r01 - Gemma - 3 - 1B - IT Free Multimodal Model - Open - source, Supports Text and Image Input to Generate Text

R01 Gemma 3 1b It

Developed by EpistemeAI

Gemma 3 is a lightweight open-source multimodal model introduced by Google, built on the same technology as Gemini, supporting text and image inputs to generate text outputs.

Text-to-Image

Transformers

English#Multimodal Reasoning #128K Long Context #Lightweight Open Source

Downloads 17

Release Time : 4/9/2025

Model Overview

A lightweight Gemma 3 model with reasoning capabilities, suitable for various text generation tasks such as Q&A, summarization, and reasoning.

Model Features

Multimodal Capability

Supports text and image inputs, capable of handling image content analysis tasks.

Lightweight Design

With 1B parameters, it is suitable for deployment in resource-limited environments.

Large Context Window

Supports a context length of 128K tokens.

Efficient Inference

Optimized with Unsloth for training, achieving 2x speed improvement.

Model Capabilities

Text Generation

Image Content Analysis

Q&A Systems

Text Summarization

Code Generation

Mathematical Reasoning

Use Cases

Content Creation

Creative Writing

Generates creative texts such as poetry, scripts, and stories.

Marketing Copy

Automatically generates product descriptions and ad copies.

Research & Education

Knowledge Q&A

Answers academic questions and provides explanations.

Learning Assistance

Helps students understand complex concepts and solve problems.

Business Applications

Customer Service

Builds intelligent customer service dialogue systems.

Document Processing

Automatically summarizes and analyzes business documents.

🚀 Gemma 3 model card

Gemma 3 is a multimodal model from Google, capable of handling text and image input and generating text output. It has a large context window, multilingual support, and is available in various sizes, suitable for a wide range of text generation and image understanding tasks.

🚀 Quick Start

To quickly get started with the Gemma 3 model, you can refer to the official Gemma page.

✨ Features

Multimodal Capability: Handles both text and image input, generating text output.
Large Context Window: Up to 128K context window for the 4B, 12B, and 27B sizes, and 32K for the 1B size.
Multilingual Support: Supports over 140 languages.
Diverse Sizes: Available in multiple sizes to suit different resource requirements.
Suitable for Various Tasks: Well - suited for text generation, image analysis, question answering, summarization, and reasoning tasks.

📚 Documentation

Model Information

Description

Gemma is a family of lightweight, state - of - the - art open models from Google, built using the same research and technology as the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output. They have open weights for both pre - trained and instruction - tuned variants, a large 128K context window (32K for the 1B size), multilingual support in over 140 languages, and are available in more sizes than previous versions. These models are suitable for a variety of text generation and image understanding tasks, and their relatively small size allows for deployment in resource - limited environments.

Inputs and outputs

Input:
- Text string, such as a question, a prompt, or a document to be summarized.
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each.
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document.
- Total output context of 8192 tokens.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Model Data

Training Dataset

This particular model was trained with the openai/gsm8k dataset using the GRPO fine - tuning method. These models were trained on a diverse dataset of text data, including web documents, code, mathematics, and images. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens.

Web Documents: Ensure the model is exposed to a broad range of linguistic styles, topics, and vocabulary, with content in over 140 languages.
Code: Helps the model learn programming language syntax and patterns, improving its code generation and understanding abilities.
Mathematics: Enables the model to learn logical reasoning, symbolic representation, and handle mathematical queries.
Images: Allows the model to perform image analysis and visual data extraction tasks.

Data Preprocessing

CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to exclude harmful and illegal content.
Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from training sets to make Gemma pre - trained models safe and reliable.
Additional methods: Filtering based on content quality and safety in line with [our policies][safety - policies].

Implementation Information

Hardware

Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p, and TPUv5e). TPUs offer several advantages for training vision - language models:

Performance: Specifically designed to handle the massive computations involved in training VLMs, speeding up training compared to CPUs.
Memory: Often come with large amounts of high - bandwidth memory, allowing for handling large models and batch sizes during training, which can lead to better model quality.
Scalability: TPU Pods provide a scalable solution for handling the growing complexity of large foundation models, enabling distributed training across multiple TPU devices for faster and more efficient processing.
Cost - effectiveness: In many scenarios, more cost - effective for training large models compared to CPU - based infrastructure, considering the time and resources saved due to faster training.
Sustainability: Aligned with [Google's commitments to operate sustainably][sustainability].

Software

Training was done using [JAX][jax] and [ML Pathways][ml - pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks, which is suitable for foundation models like Gemma.

Evaluation

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:

Reasoning and factuality

Benchmark	Metric	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
[HellaSwag][hellaswag]	10 - shot	62.3	77.2	84.2	85.6
[BoolQ][boolq]	0 - shot	63.2	72.3	78.8	82.4
[PIQA][piqa]	0 - shot	73.8	79.6	81.8	83.3
[SocialIQA][socialiqa]	0 - shot	48.9	51.9	53.4	54.9
[TriviaQA][triviaqa]	5 - shot	39.8	65.8	78.2	85.5
[Natural Questions][naturalq]	5 - shot	9.48	20.0	31.4	36.1
[ARC - c][arc]	25 - shot	38.4	56.2	68.9	70.6
[ARC - e][arc]	0 - shot	73.0	82.4	88.3	89.0
[WinoGrande][winogrande]	5 - shot	58.2	64.7	74.3	78.8
[BIG - Bench Hard][bbh]	few - shot	28.4	50.9	72.6	77.7
[DROP][drop]	1 - shot	42.4	60.1	72.2	77.2

STEM and code

Benchmark	Metric	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
[MMLU][mmlu]	5 - shot	59.6	74.5	78.6
[MMLU][mmlu] (Pro COT)	5 - shot	29.2	45.3	52.2
[AGIEval][agieval]	3 - 5 - shot	42.1	57.4	66.2
[MATH][math]	4 - shot	24.2	43.3	50.0
[GSM8K][gsm8k]	8 - shot	38.4	71.0	82.6
[GPQA][gpqa]	5 - shot	15.0	25.4	24.3
[MBPP][mbpp]	3 - shot	46.0	60.4	65.6
[HumanEval][humaneval]	0 - shot	36.0	45.7	48.8

Multilingual

Benchmark	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
[MGSM][mgsm]	2.04	34.7	64.3	74.3
[Global - MMLU - Lite][global - mmlu - lite]	24.9	57.0	69.4	75.7
[WMT24++][wmt24pp] (ChrF)	36.7	48.4	53.9	55.7
[FloRes][flores]	29.5	39.2	46.0	48.8
[XQuAD][xquad] (all)	43.9	68.0	74.5	76.8
[ECLeKTic][eclektic]	4.69	11.0	17.2	24.4
[IndicGenBench][indicgenbench]	41.4	57.2	61.7	63.4

Multimodal

Benchmark	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
[COCOcap][coco - cap]	102	111	116
[DocVQA][docvqa] (val)	72.8	82.3	85.6
[InfoVQA][info - vqa] (val)	44.1	54.8	59.4
[MMMU][mmmu] (pt)	39.2	50.3	56.1
[TextVQA][textvqa] (val)	58.9	66.5	68.6
[RealWorldQA][realworldqa]	45.5	52.2	53.9
[ReMI][remi]	27.3	38.5	44.8
[AI2D][ai2d]	63.2	75.2	79.0
[ChartQA][chartqa]	63.6	74.7	76.3
[VQAv2][vqav2]	63.9	71.2	72.9
[BLINK][blinkvqa]	38.0	35.9	39.6
[OKVQA][okvqa]	51.0	58.7	60.2
[TallyQA][tallyqa]	42.5	51.8	54.3
[SpatialSense VQA][ss - vqa]	50.9	60.0	59.4
[CountBenchQA][countbenchqa]	26.1	17.8	68.0

Ethics and Safety

Evaluation Approach

Our evaluation methods include structured evaluations and internal red - teaming testing of relevant content policies. Red - teaming was conducted by multiple teams with different goals and human evaluation metrics. The models were evaluated against several ethics and safety categories:

Child Safety: Evaluation of text - to - text and image - to - text prompts covering child safety policies, including child sexual abuse and exploitation.
Content Safety: Evaluation of text - to - text and image - to - text prompts covering safety policies such as harassment, violence, gore, and hate speech.
Representational Harms: Evaluation of text - to - text and image - to - text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.

In addition to development - level evaluations, we conduct "assurance evaluations" for responsibility governance decision - making. These evaluations are conducted separately from the model development team, and high - level findings are fed back to the model team while prompt sets are held out to prevent overfitting. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.

Evaluation Results

For all areas of safety testing, significant improvements were observed in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate the model's capabilities and behaviors. Across all model sizes and for both text - to - text and image - to - text, the model produced minimal policy violations and showed significant improvements over previous Gemma models' performance regarding ungrounded inferences. A limitation of our evaluations was that they included only English language prompts.

Usage and Limitations

Intended Usage

Open vision - language models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not exhaustive:

Content Creation and Communication
- Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text corpora, research papers, or reports.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

Training Data: The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses, and the scope of the training dataset determines the subject areas the model can handle effectively.
Context and Task Complexity: Models are better at tasks with clear prompts and instructions. Open - ended or highly complex tasks may be challenging, and a model's performance can be influenced by the amount of context provided.
Language Ambiguity and Nuance: Natural language is complex, and models may struggle to grasp subtle nuances, sarcasm, or figurative language.
Factual Accuracy: Models generate responses based on training data, but they are not knowledge bases and may generate incorrect or outdated factual statements.
Common Sense: Models rely on statistical patterns in language and may lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

Bias and Fairness: VLMs trained on large - scale, real - world text and image data can reflect socio - cultural biases in the training material. These models underwent careful scrutiny, input data pre - processing, and posterior evaluations.
Misinformation and Misuse: VLMs can be misused to generate false, misleading, or harmful text. Guidelines for responsible use are provided in the [Responsible Generative AI Toolkit][rai - toolkit].
Transparency and Accountability: This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

Perpetuation of biases: Continuous monitoring (using evaluation metrics, human review) and exploration of de - biasing techniques during model training, fine - tuning, and other use cases are encouraged.
Generation of harmful content: Mechanisms and guidelines for content safety are essential.

📄 License

The model is under the [Gemma](https://example.com/gemma - license) license.

[g3 - tech - report]: https://example.com/gemma3 - technical - report [rai - toolkit]: https://example.com/responsible - generative - ai - toolkit [kaggle - gemma]: https://www.kaggle.com/gemma [vertex - mg - gemma3]: https://example.com/vertex - model - garden - gemma3 [terms]: https://example.com/terms - of - use [tpu]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit [jax]: https://github.com/google/jax [ml - pathways]: https://example.com/ml - pathways [gemini - 2 - paper]: https://example.com/gemini - 2 - paper [safety - policies]: https://example.com/safety - policies [sustainability]: https://example.com/sustainability - commitments [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google - research - datasets/natural - questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global - mmlu - lite]: https://huggingface.co/datasets/CohereForAI/Global - MMLU - Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 [coco - cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info - vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss - vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google - research/big_vision/blob/main/big_vision/datasets/countbenchqa/

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご