Gemma 3 Open-source Multimodal Model - Free Deployment, Supports Text and Image Input to Generate Text

Gemma 3 4b It Qat Q4 0 Gguf

Developed by vinimuchulski

Gemma 3 is a lightweight open-source multimodal model family launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.

Image-to-Text #Multimodal understanding #128K long context #Lightweight deployment

Downloads 197

Release Time : 4/3/2025

Model Overview

Gemma 3 is a lightweight open-source model that supports multimodal processing, capable of handling text and image inputs and generating text outputs. It is suitable for various tasks such as question answering, summarization, and reasoning, and is particularly suitable for deployment in resource-constrained environments.

Model Features

Multimodal processing

Supports simultaneous processing of text and image inputs to generate text outputs

Large context window

Has a large context window of 128K, capable of handling long texts and complex tasks

Lightweight deployment

Relatively small model size, suitable for deployment in resource-constrained environments

Quantization-aware training

Adopts QAT technology to maintain model quality while reducing memory requirements

Model Capabilities

Text generation

Image content analysis

Multilingual processing

Code generation

Mathematical reasoning

Document summarization

Visual question answering

Use Cases

Content creation and communication

Creative writing

Generate creative texts such as poems, scripts, and marketing copy

Can generate high-quality creative content that meets the theme

Chatbot

Used for the dialogue interface of customer service or virtual assistants

Provides a natural and smooth dialogue experience

Research and education

Language learning tool

Assist in grammar correction and writing practice

Help learners improve their language skills

Knowledge exploration

Generate summaries of specific topics or answer questions

Help researchers quickly obtain information

🚀 Gemma 3 model card

Gemma 3 is a multimodal model from Google, capable of handling text and image input and generating text output. It has a large context window, multilingual support, and is available in various sizes, making it suitable for a wide range of text generation and image understanding tasks.

🚀 Quick Start

This repository corresponds to the 4B instruction-tuned version of the Gemma 3 model in GGUF format using Quantization Aware Training (QAT). The GGUF corresponds to Q4_0 quantization. Thanks to QAT, the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model.

You can find the half-precision version here.

✨ Features

Multimodal Capability: Handles both text and image input, generating text output.
Large Context Window: Has a 128K context window, enabling better handling of long - form content.
Multilingual Support: Supports over 140 languages.
Multiple Sizes: Available in different sizes (1B, 4B, 12B, 27B) to suit various resource requirements.

📦 Installation

There is no specific installation process described in the original README. However, to use the model, you need to have the necessary software environments such as llama.cpp or ollama installed.

💻 Usage Examples

Basic Usage

llama.cpp (text-only)

./llama-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Write a poem about the Kraken."

ollama (text only)

ollama run hf.co/google/gemma-3-4b-it-qat-q4_0-gguf

Advanced Usage

llama.cpp (image input)

wget https://github.com/bebechien/gemma/blob/main/surprise.png?raw=true -O ~/Downloads/surprise.png
./llama-gemma3-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Describe this image." --image ~/Downloads/surprise.png

📚 Documentation

Model Information

Description: Gemma is a family of lightweight, state - of - the - art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre - trained variants and instruction - tuned variants.
Inputs and outputs:
- Input:
  - Text string, such as a question, a prompt, or a document to be summarized.
  - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each.
  - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
- Output:
  - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document.
  - Total output context of 8192 tokens.

Model Data

Training Dataset: These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Key components include web documents, code, mathematics, and images.
Data Preprocessing:
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process.
- Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from training sets.
- Additional methods: Filtering based on content quality and safety in line with [our policies][safety - policies].

Implementation Information

Hardware: Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). TPUs offer advantages in performance, memory, scalability, cost - effectiveness, and are aligned with [Google's commitments to operate sustainably][sustainability].
Software: Training was done using [JAX][jax] and [ML Pathways][ml - pathways]. JAX allows for faster and more efficient training on hardware like TPUs, and ML Pathways is suitable for building foundation models.

Evaluation

The evaluation in this section corresponds to the original checkpoint, not the QAT checkpoint.

Benchmark Results:
- Reasoning and factuality: Evaluated on benchmarks like [HellaSwag][hellaswag], [BoolQ][boolq], etc.
- STEM and code: Tested on benchmarks such as [MMLU][mmlu], [AGIEval][agieval], etc.
- Multilingual: Benchmarked on [MGSM][mgsm], [Global - MMLU - Lite][global - mmlu - lite], etc.
- Multimodal: Evaluated using benchmarks like [COCOcap][coco - cap], [DocVQA][docvqa], etc.

Ethics and Safety

Evaluation Approach: Includes structured evaluations and internal red - teaming testing of relevant content policies. Categories evaluated are child safety, content safety, and representational harms. Assurance evaluations are also conducted for responsibility governance decision - making.
Evaluation Results: Major improvements were seen in child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters. A limitation is that only English language prompts were included.

Usage and Limitations

Intended Usage:
- Content Creation and Communication: Text generation, chatbots, text summarization, image data extraction.
- Research and Education: NLP and VLM research, language learning tools, knowledge exploration.
Limitations:
- Training Data: Quality and diversity of training data can affect model capabilities.
- Context and Task Complexity: Models may struggle with open - ended or highly complex tasks.
- Language Ambiguity and Nuance: Difficulty in grasping subtle language nuances.
- Factual Accuracy: May generate incorrect or outdated factual statements.
- Common Sense: Lack of common sense reasoning in certain situations.
Ethical Considerations and Risks: Concerns about bias and fairness, privacy and security, and misuse of the model.

🔧 Technical Details

Model Architecture: Based on the same research and technology as the Gemini models, enabling multimodal processing.
Quantization: Uses Quantization Aware Training (QAT) with Q4_0 quantization in the GGUF format, reducing memory requirements while maintaining quality.

📄 License

The model is under the [Gemma][terms] license.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Additional Resources

Model Page: Gemma
Resources and Technical Documentation:
- [Gemma 3 Technical Report][g3 - tech - report]
- [Responsible Generative AI Toolkit][rai - toolkit]
- [Gemma on Kaggle][kaggle - gemma]
- [Gemma on Vertex Model Garden][vertex - mg - gemma3]
Terms of Use: [Terms][terms]
Authors: Google DeepMind

[g3 - tech - report]: https://example.com/g3 - tech - report [rai - toolkit]: https://example.com/rai - toolkit [kaggle - gemma]: https://example.com/kaggle - gemma [vertex - mg - gemma3]: https://example.com/vertex - mg - gemma3 [terms]: https://example.com/terms [tpu]: https://example.com/tpu [jax]: https://example.com/jax [ml - pathways]: https://example.com/ml - pathways [gemini - 2 - paper]: https://example.com/gemini - 2 - paper [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google - research - datasets/natural - questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global - mmlu - lite]: https://huggingface.co/datasets/CohereForAI/Global - MMLU - Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 [coco - cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info - vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss - vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google - research/big_vision/blob/main/big_vision/datasets/countbenchqa/ [safety - policies]: https://example.com/safety - policies [sustainability]: https://example.com/sustainability

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご