Open-source gemma-3-12b-it-GPTQ-4b-128g model, reduce disk and memory requirements, and make deployment easier

Gemma 3 12b It GPTQ 4b 128g

Developed by ISTA-DASLab

This model is an INT4 quantized version of google/gemma-3-12b-it, using the GPTQ algorithm to reduce parameters from 16-bit to 4-bit, significantly decreasing disk space and GPU memory requirements.

Image-to-Text

Transformers

#INT4 quantization #Multimodal dialogue #High compression rate

Downloads 1,175

Release Time : 4/11/2025

Model Overview

An INT4 quantized version based on Gemma-3-12b-it, suitable for text generation and multimodal tasks, maintaining most of the original model's performance while significantly reducing resource demands.

Model Features

Efficient INT4 Quantization

Uses the GPTQ algorithm to reduce parameters from 16-bit to 4-bit, cutting storage and memory requirements by approximately 75%.

Performance Retention

Maintains 98.42% of the original model's performance in OpenLLM benchmark tests.

Multimodal Support

Supports joint processing of images and text, capable of understanding and describing image content.

Model Capabilities

Text generation

Image content understanding

Multimodal task processing

Dialogue systems

Use Cases

Content generation

Image caption generation

Generates detailed descriptions based on input images

Accurately identifies objects and scenes in images and generates fluent descriptions

Intelligent assistant

Multimodal dialogue

Engages in dialogue combining image and text inputs

Understands image content and answers related questions

🚀 gemma-3-12b-it-GPTQ-4b-128g

This model is a quantized version of gemma-3-12b-it, optimizing disk space and GPU memory usage.

✨ Features

Quantization Optimization: The model is created by quantizing the weights of gemma-3-12b-it to the INT4 data type. This reduces the bits per parameter from 16 to 4, cutting down the disk size and GPU memory requirements by about 75%.
Selective Quantization: Only the weights of linear operators within language_model transformers blocks are quantized. The vision model and multimodal projection maintain their original precision.
Quantization Scheme: Weights are quantized using a symmetric per - group scheme with a group size of 128, applying the GPTQ algorithm.
Checkpoint Format: The model checkpoint is saved in compressed_tensors format.

📚 Documentation

Model Overview

This model was obtained by quantizing the weights of gemma-3-12b-it to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within language_model transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per - group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the vLLM engine.

Model	ArcC	GSM8k	Hellaswag	MMLU	TruthfulQA - mc2	Winogrande	Average	Recovery
gemma-3-12b-it	0.7125	0.8719	0.8377	0.7230	0.5798	0.7893	0.7524	1.0000
gemma-3-12b-it-INT4 (this)	0.6988	0.8643	0.8254	0.7078	0.5638	0.7830	0.7405	0.9842

Reproduction

The results were obtained using the following commands:

MODEL=ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

💻 Usage Examples

Basic Usage

To use the model, you need to update the relevant packages:

To use the model in transformers, update the package to the stable release of Gemma3: pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
To use the model in vLLM, update the package to the version after this PR.

Here is an example of inference via transformers:

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.

📄 License

The license of this model is gemma.

Property	Details
Model Type	gemma-3-12b-it-GPTQ-4b-128g
Base Model	google/gemma-3-12b-it
Pipeline Tag	image-text-to-text
Tags	int4, vllm, llmcompressor

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご