Gemma-3-27B-It-GPTQ-4B-128G Open-source AI Model - Reduce Memory Requirements and Enable Efficient Deployment and Usage

Gemma 3 27b It GPTQ 4b 128g

Developed by ISTA-DASLab

This model is an INT4 quantized version of gemma-3-27b-it, reducing disk and GPU memory requirements by decreasing the number of bits per parameter.

Image-to-Text

Transformers

#INT4 quantization #Multimodal understanding #Efficient inference

Downloads 32.15k

Release Time : 3/14/2025

Model Overview

Obtained through INT4 quantization of gemma-3-27b-it weights, optimizing disk usage and GPU memory requirements while maintaining high performance retention.

Model Features

INT4 quantization

Reduces the number of bits per parameter from 16 to 4, significantly decreasing disk usage and GPU memory requirements.

High performance retention

Achieves a performance retention rate of 99.44% in the OpenLLM v1 benchmark.

Multimodal support

Supports joint processing of images and text, suitable for multimodal tasks.

Model Capabilities

Text generation

Image captioning

Multimodal task processing

Use Cases

Content generation

Image caption generation

Generates detailed textual descriptions based on input images.

Produces natural and accurate image captions.

Intelligent assistant

Multimodal dialogue

Engages in intelligent conversations combining image and text inputs.

Provides contextually relevant responses and suggestions.

🚀 gemma-3-27b-it-GPTQ-4b-128g

This model is obtained by quantizing the weights of gemma-3-27b-it to INT4 data type, reducing disk size and GPU memory requirements.

🚀 Quick Start

This model is a quantized version of gemma-3-27b-it, which significantly reduces disk size and GPU memory requirements. You can follow the steps below to use this model.

✨ Features

Quantization Optimization: Quantize the weights of gemma-3-27b-it to INT4 data type, reducing the number of bits per parameter from 16 to 4, and reducing the disk size and GPU memory requirements by approximately 75%.
Partial Quantization: Only the weights of the linear operators within language_model transformers blocks are quantized. Vision model and multimodal projection are kept in original precision.
Symmetric Per - Group Scheme: Weights are quantized using a symmetric per - group scheme, with group size 128.
GPTQ Algorithm: The GPTQ algorithm is applied for quantization.
Compressed Format: Model checkpoint is saved in compressed_tensors format.

📦 Installation

To use the model in transformers update the package to stable release of Gemma3:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
To use the model in vLLM update the package to version after this PR.

💻 Usage Examples

Basic Usage

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.

📚 Documentation

Model Overview

This model was obtained by quantizing the weights of gemma-3-27b-it to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within language_model transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per - group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the vLLM engine.

Model	ArcC	GSM8k	Hellaswag	MMLU	TruthfulQA - mc2	Winogrande	Average	Recovery
gemma-3-27b-it	0.7491	0.9181	0.8582	0.7742	0.6222	0.7908	0.7854	1.0000
gemma-3-27b-it-INT4 (this)	0.7415	0.9174	0.8496	0.7662	0.6160	0.7956	0.7810	0.9944

Reproduction

The results were obtained using the following commands:

MODEL=ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

📄 License

The license of this model is gemma.

Property	Details
Model Type	gemma-3-27b-it-GPTQ-4b-128g
Training Data	Not mentioned in the original README
Library Name	transformers
Pipeline Tag	image-text-to-text
Tags	int4, vllm, llmcompressor
Base Model	google/gemma-3-27b-it
License	gemma

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご