The gemma-3-12b-it-quantized open-source model - Supports text and image input and output, enabling efficient inference and deployment!

Gemma 3 12b It Quantized.w8a8

Developed by RedHatAI

An INT8 quantized version based on google/gemma-3-12b-it, supporting visual text input and text output, suitable for efficient inference deployment

Image-to-Text

Transformers

#Multimodal visual understanding #Efficient INT8 quantization inference #12B parameter large model

Downloads 237

Release Time : 6/4/2025

Model Overview

This is a quantized multimodal model. It performs weight quantization on Gemma-3-12b-it and can be efficiently deployed with vLLM. It is suitable for scenarios with visual text input and text output

Model Features

Efficient quantization

Adopt INT8 weight quantization and INT8 activation quantization to significantly reduce model size and memory usage

Multimodal support

Support joint input of images and text for cross-modal understanding and generation

Efficient inference

Achieve efficient deployment through the vLLM backend, supporting batch processing and streaming output

High-precision maintenance

The quantized model maintains performance close to the original model in multiple benchmark tests

Model Capabilities

Image content understanding

Multimodal dialogue

Text generation

Visual question answering

Use Cases

Content understanding

Image description generation

Generate natural language descriptions based on input images

Can accurately describe the main content and scenes in the image

Visual question answering

Answer natural language questions about image content

Performs well in the MMMU and ChartQA benchmark tests

Intelligent assistant

Multimodal dialogue

Conduct natural conversations by combining image and text input

Can understand the image context and generate relevant responses

🚀 gemma-3-12b-it-quantized.w8a8

This is a quantized version of google/gemma-3-12b-it, offering efficient inference with vLLM.

🚀 Quick Start

This model is a quantized version of google/gemma-3-12b-it, which can be efficiently deployed using the vLLM backend.

✨ Features

Model Architecture: Based on google/gemma-3-12b-it, supporting vision-text input and text output.
Model Optimizations: Both weight and activation are quantized to INT8 data type.
Release Date: 6/4/2025
Version: 1.0
Model Developers: RedHatAI

Property	Details
Model Type	Quantized version of google/gemma-3-12b-it
Training Data	Not provided

📦 Installation

This model can be deployed efficiently using the vLLM backend.

💻 Usage Examples

Basic Usage

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from transformers import AutoProcessor

# Define model name once
model_name = "RedHatAI/gemma-3-12b-it-quantized.w8a8"

# Load image and processor
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Build multimodal prompt
chat = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is the content of this image?"}]},
    {"role": "assistant", "content": []}
]
prompt = processor.apply_chat_template(chat, add_generation_prompt=True)

# Initialize model
llm = LLM(model=model_name, trust_remote_code=True)

# Run inference
inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))

# Display result
print("RESPONSE:", outputs[0].outputs[0].text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

📚 Documentation

Creation

This model was created with llm-compressor by running the following code:

Model Creation Code

import base64
from io import BytesIO
import torch
from datasets import load_dataset
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot


# Load model.
model_id = "google/gemma-3-12b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Oneshot arguments
DATASET_ID = "neuralmagic/calibration"
DATASET_SPLIT = {"LLM": "train[:1024]"}
NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42)

dampening_frac=0.01

def data_collator(batch):
    assert len(batch) == 1, "Only batch size of 1 is supported for calibration"
    item = batch[0]
    collated = {}
    import torch


    for key, value in item.items():
        if isinstance(value, torch.Tensor):
            collated[key] = value.unsqueeze(0)
        elif isinstance(value, list) and isinstance(value[0][0], int):
            # Handle tokenized inputs like input_ids, attention_mask
            collated[key] = torch.tensor(value)
        elif isinstance(value, list) and isinstance(value[0][0], float):
            # Handle possible float sequences
            collated[key] = torch.tensor(value)
        elif isinstance(value, list) and isinstance(value[0][0], torch.Tensor):
            # Handle batched image data (e.g., pixel_values as [C, H, W])
            collated[key] = torch.stack(value)  # -> [1, C, H, W]
        elif isinstance(value, torch.Tensor):
            collated[key] = value
        else:
            print(f"[WARN] Unrecognized type in collator for key={key}, type={type(value)}")
    
    return collated
   


# Recipe
recipe = [
    GPTQModifier(
        targets="Linear",
        ignore=["re:.*lm_head.*", "re:.*embed_tokens.*", "re:vision_tower.*", "re:multi_modal_projector.*"],
        sequential_update=True,
        sequential_targets=["Gemma3DecoderLayer"],
        dampening_frac=dampening_frac,
    )
]

SAVE_DIR=f"{model_id.split('/')[1]}-quantized.w8a8"

# Perform oneshot
oneshot(
    model=model,
    tokenizer=model_id,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    data_collator=data_collator,
    output_dir=SAVE_DIR
)

Evaluation

The model was evaluated using lm_evaluation_harness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands:

Evaluation Commands

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True,enforce_eager=True \
  --tasks openllm \
  --batch_size auto

Accuracy

Category	Metric	google/gemma-3-12b-it	RedHatAI/gemma-3-12b-it-quantized.w8a8	Recovery (%)
OpenLLM V1	ARC Challenge	68.43%	68.43%	100.00%
	GSM8K	88.10%	87.72%	99.57%
	Hellaswag	83.76%	83.53%	99.73%
	MMLU	72.15%	71.65%	99.30%
	Truthfulqa (mc2)	58.13%	58.44%	100.54%
	Winogrande	79.40%	78.77%	99.20%
	Average Score	74.99%	74.76%	99.68%
Vision Evals	MMMU (val)	48.78%	47.44%	97.25%
	ChartQA	68.08%	67.04%	98.47%
	Average Score	58.43%	57.24%	97.86%

📄 License

This model is under the gemma license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご