Gemma-3-4B-IT-GPTQ-4B-128G Open Source Model - A Tool to Reduce Storage and Computing Resource Requirements

Gemma 3 4b It GPTQ 4b 128g

Developed by ISTA-DASLab

INT4 quantized version based on the gemma-3-4b-it model, significantly reducing storage and computational resource requirements

Image-to-Text

Transformers

#INT4 Quantization #Multimodal Understanding #Efficient Inference

Downloads 502

Release Time : 4/11/2025

Model Overview

Obtained through INT4 quantization of the gemma-3-4b-it model weights, reducing approximately 75% of disk space and GPU memory requirements while maintaining good performance.

Model Features

Efficient Quantization

Utilizes INT4 quantization technology, significantly reducing model storage and computational resource requirements

Performance Retention

Maintains 96.35% of the original model's performance in OpenLLM benchmark tests

Vision-Language Capabilities

Supports multimodal input (image and text) with text output

Model Capabilities

Multimodal Understanding

Text Generation

Image Captioning

Dialogue Systems

Use Cases

Content Generation

Image Caption Generation

Generates detailed textual descriptions based on input images

Capable of accurately describing image content and scenes

Intelligent Assistants

Multimodal Dialogue

Engages in natural conversations combining image and text inputs

Provides context-aware responses

🚀 gemma-3-4b-it-GPTQ-4b-128g

This model is obtained by quantizing the weights of gemma-3-4b-it to INT4 data type, significantly reducing disk size and GPU memory requirements.

🚀 Quick Start

Model Information

Property	Details
License	gemma
Library Name	transformers
Pipeline Tag	image-text-to-text
Tags	int4, vllm, llmcompressor
Base Model	google/gemma-3-4b-it

✨ Features

This model was obtained by quantizing the weights of gemma-3-4b-it to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within language_model transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

📚 Documentation

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the vLLM engine.

Model	ArcC	GSM8k	Hellaswag	MMLU	TruthfulQA-mc2	Winogrande	Average	Recovery
gemma-3-4b-it	0.6084	0.7528	0.7497	0.5832	0.5189	0.7072	0.6534	1.0000
gemma-3-4b-it-INT4 (this)	0.5879	0.7210	0.7358	0.5650	0.4863	0.6811	0.6295	0.9635

Reproduction

The results were obtained using the following commands:

MODEL=ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

💻 Usage Examples

Basic Usage

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.

Advanced Usage

To use the model in transformers update the package to stable release of Gemma3:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
To use the model in vLLM update the package to version after this PR.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご