Open-source Gemma 3-12B Model - Supports text and image input, suitable for various text generation and understanding tasks

Gemma 3 12b It Qat Q4 0 Gguf

Developed by vinimuchulski

Gemma 3 is a lightweight open model built by Google based on Gemini technology, supporting text and image inputs to generate text outputs. The 12B version is instruction-tuned and suitable for various generation and comprehension tasks.

Image-to-Text #Multimodal Understanding #128K Long Context #Multilingual Support

Downloads 1,860

Release Time : 4/3/2025

Model Overview

Gemma 3 is a multimodal model capable of processing text and image inputs and generating text outputs, suitable for tasks such as Q&A, summarization, and reasoning. Its relatively small size allows deployment in resource-limited environments.

Model Features

Multimodal Capability

Supports text and image inputs, capable of understanding and analyzing image content

Large Context Window

Supports a context length of 128K tokens, suitable for processing long documents and complex tasks

Quantization-Aware Training

Uses QAT technology to reduce memory requirements while maintaining model quality

Multilingual Support

Supports over 140 languages, ensuring global applicability

Model Capabilities

Text Generation

Image Content Analysis

Multilingual Processing

Q&A Systems

Document Summarization

Logical Reasoning

Use Cases

Content Creation

Poetry Generation

Generates poetry or other creative texts based on prompts

Can produce creative texts that match the theme and style

Chatbot

Builds multi-turn dialogue systems

Capable of natural and fluent conversations

Research & Education

Language Learning

Assists in language learning and practice

Supports Q&A and explanations in multiple languages

NLP Research

Used for natural language processing-related research

Provides powerful text understanding and generation capabilities

Document Processing

Document Summarization

Automatically generates summaries of long documents

Can extract key information and produce concise summaries

Image Captioning

Analyzes images and generates descriptive text

Can accurately describe image content and scenes

🚀 Gemma 3 model card

Gemma 3 is a multimodal model from Google, capable of handling text and image input and generating text output. It has a large context window, multilingual support, and is suitable for various text generation and image understanding tasks.

Model Page: Gemma

⚠️ Important Note

This repository corresponds to the 12B instruction-tuned version of the Gemma 3 model in GGUF format using Quantization Aware Training (QAT). The GGUF corresponds to Q4_0 quantization. Thanks to QAT, the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model. You can find the half-precision version here.

Resources and Technical Documentation:

Terms of Use: Terms

Authors: Google DeepMind

✨ Features

A large, 128K context window.
Multilingual support in over 140 languages.
Availability in more sizes than previous versions.
Suitability for a variety of text generation and image understanding tasks, such as question answering, summarization, and reasoning.
Ability to be deployed in environments with limited resources like laptops, desktops, or personal cloud infrastructure.

📦 Installation

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.

Property	Details
Model Type	gemma
Training Data	A diverse dataset including web documents, code, mathematics, and images

💻 Usage Examples

Basic Usage

llama.cpp (text-only)

./llama-cli -hf google/gemma-3-12b-it-qat-q4_0-gguf -p "Write a poem about the Kraken."

llama.cpp (image input)

wget https://github.com/bebechien/gemma/blob/main/surprise.png?raw=true -O ~/Downloads/surprise.png
./llama-gemma3-cli -hf google/gemma-3-12b-it-qat-q4_0-gguf -p "Describe this image." --image ~/Downloads/surprise.png

ollama (text only)

Using GGUFs with Ollama via Hugging Face does not support image inputs at the moment. Please check the docs on running gated repositories.

ollama run hf.co/google/gemma-3-12b-it-qat-q4_0-gguf

📚 Documentation

Model Information

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. They are well-suited for a variety of text generation and image understanding tasks, and their relatively small size allows for deployment in resource-limited environments.

Inputs and outputs

Input:
- Text string, such as a question, a prompt, or a document to be summarized.
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each.
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document.
- Total output context of 8192 tokens.

Model Data

Training Dataset

These models were trained on a diverse dataset of text data from various sources. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens. Key components include:

Web Documents: A diverse collection of web text in over 140 languages, exposing the model to a broad range of linguistic styles, topics, and vocabulary.
Code: Helps the model learn programming language syntax and patterns, improving its code generation and understanding abilities.
Mathematics: Enables the model to learn logical reasoning, symbolic representation, and address mathematical queries.
Images: Allows the model to perform image analysis and visual data extraction tasks.

Data Preprocessing

Key data cleaning and filtering methods applied to the training data include:

CSAM Filtering: Rigorous filtering at multiple stages to exclude harmful and illegal content.
Sensitive Data Filtering: Automated techniques to filter out personal information and other sensitive data.
Additional methods: Filtering based on content quality and safety in line with our policies.

Implementation Information

Hardware

Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e). TPUs offer several advantages for training vision-language models:

Performance: Specifically designed to handle massive computations, speeding up training compared to CPUs.
Memory: Often come with large amounts of high-bandwidth memory, allowing for handling large models and batch sizes during training, leading to better model quality.
Scalability: TPU Pods provide a scalable solution for handling large foundation models, enabling distributed training across multiple devices.
Cost-effectiveness: Can be a more cost-effective solution for training large models, especially considering time and resource savings.
Alignment with Google's commitments to operate sustainably.

Software

Training was done using JAX and ML Pathways. JAX allows for faster and more efficient training of large models on the latest hardware, including TPUs. ML Pathways is Google's effort to build artificially intelligent systems capable of generalizing across multiple tasks, suitable for foundation models.

Evaluation

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:

Reasoning and factuality

Benchmark	Metric	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
HellaSwag	10-shot	62.3	77.2	84.2	85.6
BoolQ	0-shot	63.2	72.3	78.8	82.4
PIQA	0-shot	73.8	79.6	81.8	83.3
SocialIQA	0-shot	48.9	51.9	53.4	54.9
TriviaQA	5-shot	39.8	65.8	78.2	85.5
Natural Questions	5-shot	9.48	20.0	31.4	36.1
ARC-c	25-shot	38.4	56.2	68.9	70.6
ARC-e	0-shot	73.0	82.4	88.3	89.0
WinoGrande	5-shot	58.2	64.7	74.3	78.8
BIG-Bench Hard	few-shot	28.4	50.9	72.6	77.7
DROP	1-shot	42.4	60.1	72.2	77.2

STEM and code

Benchmark	Metric	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MMLU	5-shot	59.6	74.5	78.6
MMLU (Pro COT)	5-shot	29.2	45.3	52.2
AGIEval	3 - 5-shot	42.1	57.4	66.2
MATH	4-shot	24.2	43.3	50.0
GSM8K	8-shot	38.4	71.0	82.6
GPQA	5-shot	15.0	25.4	24.3
MBPP	3-shot	46.0	60.4	65.6
HumanEval	0-shot	36.0	45.7	48.8

Multilingual

Benchmark	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MGSM	2.04	34.7	64.3	74.3
Global-MMLU-Lite	24.9	57.0	69.4	75.7
WMT24++ (ChrF)	36.7	48.4	53.9	55.7
FloRes	29.5	39.2	46.0	48.8
XQuAD (all)	43.9	68.0	74.5	76.8
ECLeKTic	4.69	11.0	17.2	24.4
IndicGenBench	41.4	57.2	61.7	63.4

Multimodal

Benchmark	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
COCOcap	102	111	116
DocVQA (val)	72.8	82.3	85.6
InfoVQA (val)	44.1	54.8	59.4
MMMU (pt)	39.2	50.3	56.1
TextVQA (val)	58.9	66.5	68.6
RealWorldQA	45.5	52.2	53.9
ReMI	27.3	38.5	44.8
AI2D	63.2	75.2	79.0
ChartQA	63.6	74.7	76.3
VQAv2	63.9	71.2	72.9
BLINK	38.0	35.9	39.6
OKVQA	51.0	58.7	60.2
TallyQA	42.5	51.8	54.3
SpatialSense VQA	50.9	60.0	59.4
CountBenchQA	26.1	17.8	68.0

Ethics and Safety

Evaluation Approach

Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by multiple teams with different goals and human evaluation metrics. Categories evaluated for ethics and safety include:

Child Safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
Content Safety: Evaluation of text-to-text and image-to-text prompts covering safety policies such as harassment, violence, gore, and hate speech.
Representational Harms: Evaluation of text-to-text and image-to-text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.

In addition to development level evaluations, we conduct "assurance evaluations" for responsibility governance decision making. These are separate from the model development team and inform release decisions. High-level findings are fed back to the model team, and prompt sets are held out to prevent overfitting.

Evaluation Results

For all areas of safety testing, we saw major improvements in child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate model capabilities and behaviors. The model produced minimal policy violations and showed significant improvements over previous models' performance with respect to ungrounded inferences. A limitation of our evaluations was the use of only English language prompts.

Usage and Limitations

Intended Usage

Open vision-language models (VLMs) have a wide range of applications across various industries and domains. Potential uses include:

Content Creation and Communication
- Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of text.
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
Research and Education
- Natural Language Processing (NLP) and VLM Research: Serve as a foundation for research and algorithm development.
- Language Learning Tools: Support interactive language learning experiences.
- Knowledge Exploration: Assist in exploring large bodies of text.

Limitations

Training Data: The quality and diversity of training data can influence model capabilities, and biases or gaps in the data can lead to limitations in responses.
Context and Task Complexity: Models perform better on tasks with clear prompts, and performance can be affected by the amount of context provided.
Language Ambiguity and Nuance: Natural language complexity can make it difficult for models to grasp subtle nuances.
Factual Accuracy: Models may generate incorrect or outdated factual statements as they are not knowledge bases.
Common Sense: Models may lack the ability to apply common sense reasoning in certain situations.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

📄 License

The license for this model is gemma.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご