Gemma 3-4b-it-qat-autoawq Open-Source Multimodal Model - Free support for input of images and text and generation of text

Gemma 3 4b It Qat Autoawq

Developed by gaunernst

Gemma 3 is a lightweight open-source multimodal model launched by Google, built on Gemini technology, supporting text and image input and generating text output.

Image-to-Text

Safetensors

#Multimodal processing #128K long context #Lightweight deployment

Downloads 503

Release Time : 4/6/2025

Model Overview

Gemma 3 is a multimodal model capable of processing text and image input and generating text output. It has a large context window of 128K and supports over 140 languages, suitable for tasks such as question answering, summary generation, and reasoning.

Model Features

Multimodal processing

Capable of simultaneously processing text and image input and generating text output.

Large context window

Supports a context window of 128K, suitable for processing long texts and complex tasks.

Multilingual support

Supports over 140 languages, with wide applicability.

Lightweight design

Relatively small scale, can be deployed in resource-constrained environments.

Model Capabilities

Text generation

Image understanding

Question answering system

Summary generation

Code generation

Multilingual processing

Use Cases

Content creation and communication

Text generation

Generate creative text formats, such as poems, scripts, code, marketing copy, and email drafts.

Chatbots and conversational AI

Provide a conversational interface for customer service, virtual assistants, or interactive applications.

Research and education

Natural language processing research

Serve as a base model for researchers to experiment with VLM and NLP technologies.

Language learning tools

Support interactive language learning experiences, helping with grammar correction or providing writing exercises.

🚀 Gemma 3 4B Instruction-tuned QAT AutoAWQ

Gemma 3 4B Instruction-tuned QAT AutoAWQ is a checkpoint converted from google/gemma-3-4b-it-qat-q4_0-gguf to AutoAWQ format and BF16 dtype. The vision tower was transplanted from google/gemma-3-4b-it.

Below is the original model card.

🚀 Quick Start

Model Information

Base Model: google/gemma-3-4b-it
License: gemma
Tags: gemma3, gemma, google
Pipeline Tag: image-text-to-text

Usage Examples

Basic Usage

llama.cpp (text-only)

./llama-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Write a poem about the Kraken."

llama.cpp (image input)

wget https://github.com/bebechien/gemma/blob/main/surprise.png?raw=true -O ~/Downloads/surprise.png
./llama-gemma3-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Describe this image." --image ~/Downloads/surprise.png

ollama (text only)

ollama run hf.co/google/gemma-3-4b-it-qat-q4_0-gguf

✨ Features

Multimodal Capability: Handles text and image input, generating text output.
Large Context Window: Supports a 128K context window (32K for the 1B size), enabling more comprehensive input.
Multilingual Support: Offers support for over 140 languages.
Open Weights: Both pre-trained and instruction-tuned variants have open weights.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Information

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built on the same research and technology as the Gemini models. Gemma 3 models are multimodal, capable of handling text and image input and generating text output. They have open weights for both pre-trained and instruction-tuned variants. With a large 128K context window, multilingual support in over 140 languages, and more size options than previous versions, Gemma 3 models are suitable for various text generation and image understanding tasks. Their relatively small size allows deployment in resource-limited environments, democratizing access to advanced AI models.

Inputs and outputs

Input	Output
- Text string (e.g., question, prompt, document to summarize) - Images (normalized to 896 x 896 resolution and encoded to 256 tokens each) - Total input context of 128K tokens for 4B, 12B, and 27B sizes; 32K tokens for 1B size	- Generated text (e.g., answer to a question, analysis of image content, summary of a document) - Total output context of 8192 tokens

Model Data

Training Dataset

These models were trained on a diverse text dataset, including web documents, code, mathematics, and images. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens.

Data Preprocessing

CSAM Filtering: Rigorous filtering to exclude child sexual abuse material.
Sensitive Data Filtering: Automated techniques to filter out personal and sensitive data.
Additional Methods: Filtering based on content quality and safety according to our policies.

Implementation Information

Hardware

Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e). TPUs offer advantages in performance, memory, scalability, cost-effectiveness, and align with Google's sustainability commitments.

Software

Training was done using JAX and ML Pathways. JAX enables efficient use of hardware, while ML Pathways is suitable for building foundation models.

Evaluation

⚠️ Important Note

The evaluation in this section corresponds to the original checkpoint, not the QAT checkpoint.

Benchmark Results

The models were evaluated on various datasets and metrics for different aspects of text generation, including reasoning, STEM and code, multilingual, and multimodal tasks.

Benchmark	Metric	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
HellaSwag	10-shot	62.3	77.2	84.2	85.6
BoolQ	0-shot	63.2	72.3	78.8	82.4
...	...	...	...	...	...

Ethics and Safety

Evaluation Approach

The evaluation methods include structured evaluations and internal red-teaming testing. The models were evaluated in categories such as child safety, content safety, and representational harms. Assurance evaluations are also conducted for responsibility governance decision making.

Evaluation Results

Significant improvements were observed in child safety, content safety, and representational harms compared to previous Gemma models. All testing was done without safety filters. However, the evaluations only included English language prompts.

Usage and Limitations

Intended Usage

Content Creation and Communication: Text generation, chatbots, text summarization, and image data extraction.
Research and Education: NLP and VLM research, language learning tools, and knowledge exploration.

Limitations

Training Data: Quality and diversity of training data can affect model capabilities.
Context and Task Complexity: Models perform better with clear prompts and instructions.
Language Ambiguity and Nuance: Struggle with subtle language nuances.
Factual Accuracy: May generate incorrect or outdated factual statements.
Common Sense: Lack the ability to apply common sense reasoning.

🔧 Technical Details

Model Conversion

This checkpoint was converted from google/gemma-3-4b-it-qat-q4_0-gguf to AutoAWQ format and BF16 dtype.

Vision Tower Transplant

The vision tower was transplanted from google/gemma-3-4b-it.

📄 License

The model is licensed under the [gemma] license.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご