Gemma 3 Open-Source Lightweight AI Model - Supports Text and Image Inputs, Includes Pretrained and Instruction-Tuned Versions

Gemma 3 27b It Int4 Gguf

Developed by gaunernst

Gemma 3 is a lightweight cutting-edge open model family from Google, built on the same research technology as Gemini models. Supports text/image input and text output, offering both pretrained and instruction-tuned weight versions.

Image-to-Text #Multimodal Understanding #128K Long Context #Lightweight Deployment

Downloads 232

Release Time : 3/31/2025

Model Overview

Gemma 3 is a multimodal model supporting text/image input and text output, suitable for text generation and image understanding tasks like Q&A/summarization/reasoning. Compact size enables deployment in resource-constrained environments like laptops/desktops.

Model Features

Multimodal Support

Supports text and image input, capable of image understanding and text generation

Massive Context Window

Provides 128K token context window, ideal for long documents and complex tasks

Multilingual Support

Supports 140+ languages with cross-lingual processing capabilities

Lightweight Design

Compact size suitable for deployment in resource-constrained environments

Model Capabilities

Text Generation

Image Understanding

Multilingual Processing

Question Answering

Document Summarization

Reasoning Tasks

Use Cases

Content Creation

Image Captioning

Generates detailed textual descriptions from input images

Accurately identifies image content and produces coherent descriptions

Text Composition

Generates creative text content based on prompts

Produces fluent and coherent text content

Research & Education

Language Learning

Assists in language learning and translation tasks

Supports translation and interpretation for 140+ languages

NLP Experiments

Used for natural language processing research and experiments

Provides powerful text processing and analysis capabilities

🚀 Gemma 3 27B Instruction-tuned INT4

This is a QAT INT4 Flax checkpoint (from Kaggle) converted to GGUF format for easy use. You can find the conversion script at GitHub. Note that this is not the same as the official QAT INT4 GGUFs released here. Below is the original Model card from Google Gemma 3 27B IT.

🚀 Quick Start

Access Gemma on Hugging Face

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. [Acknowledge license]

Installation

First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.

$ pip install -U transformers

Usage Examples

Basic Usage

You can initialize the model and processor for inference with pipeline as follows.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-27b-it",
    device="cuda",
    torch_dtype=torch.bfloat16
)

With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
# Okay, let's take a look! 
# Based on the image, the animal on the candy is a **turtle**. 
# You can see the shell shape and the head and legs.

Advanced Usage

Running the model on a single/multi GPU

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3-27b-it"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.

✨ Features

Model Information

Summary: Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants.
Inputs and Outputs:
- Input: Text string, images (normalized to 896 x 896 resolution and encoded to 256 tokens each), with a total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.
- Output: Generated text in response to the input, with a total output context of 8192 tokens.

Model Data

Training Dataset: These models were trained on a dataset of text data that includes web documents, code, mathematics, and images. The 27B model was trained with 14 trillion tokens, the 12B model with 12 trillion tokens, the 4B model with 4 trillion tokens, and the 1B model with 2 trillion tokens.
Data Preprocessing: Key data cleaning and filtering methods include CSAM filtering, sensitive data filtering, and additional filtering based on content quality and safety.

Implementation Information

Hardware: Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e).
Software: Training was done using JAX and ML Pathways.

📚 Documentation

Model Page

Gemma

Resources and Technical Documentation

Terms of Use

Terms

Authors

Google DeepMind

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

🔧 Technical Details

Benchmark Results

Reasoning and factuality

Benchmark	Metric	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
HellaSwag	10-shot	62.3	77.2	84.2	85.6
BoolQ	0-shot	63.2	72.3	78.8	82.4
PIQA	0-shot	73.8	79.6	81.8	83.3
SocialIQA	0-shot	48.9	51.9	53.4	54.9
TriviaQA	5-shot	39.8	65.8	78.2	85.5
Natural Questions	5-shot	9.48	20.0	31.4	36.1
ARC-c	25-shot	38.4	56.2	68.9	70.6
ARC-e	0-shot	73.0	82.4	88.3	89.0
WinoGrande	5-shot	58.2	64.7	74.3	78.8
BIG-Bench Hard	few-shot	28.4	50.9	72.6	77.7
DROP	1-shot	42.4	60.1	72.2	77.2

STEM and code

Benchmark	Metric	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MMLU	5-shot	59.6	74.5	78.6
MMLU (Pro COT)	5-shot	29.2	45.3	52.2
AGIEval	3 - 5-shot	42.1	57.4	66.2
MATH	4-shot	24.2	43.3	50.0
GSM8K	8-shot	38.4	71.0	82.6
GPQA	5-shot	15.0	25.4	24.3
MBPP	3-shot	46.0	60.4	65.6
HumanEval	0-shot	36.0	45.7	48.8

Multilingual

Benchmark	Gemma 3 PT 1B	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
MGSM	2.04	34.7	64.3	74.3
Global-MMLU-Lite	24.9	57.0	69.4	75.7
WMT24++ (ChrF)	36.7	48.4	53.9	55.7
FloRes	29.5	39.2	46.0	48.8
XQuAD (all)	43.9	68.0	74.5	76.8
ECLeKTic	4.69	11.0	17.2	24.4
IndicGenBench	41.4	57.2	61.7	63.4

Multimodal

Benchmark	Gemma 3 PT 4B	Gemma 3 PT 12B	Gemma 3 PT 27B
COCOcap	102	111	116
DocVQA (val)	72.8	82.3	85.6
InfoVQA (val)	44.1	54.8	59.4
MMMU (pt)	39.2	50.3	56.1
TextVQA (val)	58.9	66.5	68.6
RealWorldQA	45.5	52.2	53.9
ReMI	27.3	38.5	44.8
AI2D	63.2	75.2	79.0
ChartQA	63.6	74.7	76.3
VQAv2	63.9	71.2	72.9
BLINK	38.0	35.9	39.6
OKVQA	51.0	58.7	60.2
TallyQA	42.5	51.8	54.3
SpatialSense VQA	50.9	60.0	59.4
CountBenchQA	26.1	17.8	68.0

Ethics and Safety

Evaluation Approach: Our evaluation methods include structured evaluations and internal red - teaming testing of relevant content policies. These models were evaluated against categories such as child safety, content safety, and representational harms.
Evaluation Results: For all areas of safety testing, we saw major improvements relative to previous Gemma models. All testing was conducted without safety filters. A limitation was that only English language prompts were included.

Usage and Limitations

Intended Usage: Open vision - language models (VLMs) have a wide range of applications, including content creation, chatbots, and text summarization.
Limitations: Users should be aware of certain limitations of these models.

📄 License

The license is Gemma.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご