đ gemma-3-27b-it-GPTQ-4b-128g
This model is obtained by quantizing the weights of gemma-3-27b-it to INT4 data type, reducing disk size and GPU memory requirements.
đ Quick Start
This model is a quantized version of gemma-3-27b-it, which significantly reduces disk size and GPU memory requirements. You can follow the steps below to use this model.
⨠Features
- Quantization Optimization: Quantize the weights of gemma-3-27b-it to INT4 data type, reducing the number of bits per parameter from 16 to 4, and reducing the disk size and GPU memory requirements by approximately 75%.
- Partial Quantization: Only the weights of the linear operators within
language_model
transformers blocks are quantized. Vision model and multimodal projection are kept in original precision.
- Symmetric Per - Group Scheme: Weights are quantized using a symmetric per - group scheme, with group size 128.
- GPTQ Algorithm: The GPTQ algorithm is applied for quantization.
- Compressed Format: Model checkpoint is saved in compressed_tensors format.
đĻ Installation
-
To use the model in transformers
update the package to stable release of Gemma3:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
-
To use the model in vLLM
update the package to version after this PR.
đģ Usage Examples
Basic Usage
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g"
model = Gemma3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
đ Documentation
Model Overview
This model was obtained by quantizing the weights of gemma-3-27b-it to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Only the weights of the linear operators within language_model
transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per - group scheme, with group size 128. The GPTQ algorithm is applied for quantization.
Model checkpoint is saved in compressed_tensors format.
Evaluation
This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the vLLM
engine.
Model |
ArcC |
GSM8k |
Hellaswag |
MMLU |
TruthfulQA - mc2 |
Winogrande |
Average |
Recovery |
gemma-3-27b-it |
0.7491 |
0.9181 |
0.8582 |
0.7742 |
0.6222 |
0.7908 |
0.7854 |
1.0000 |
gemma-3-27b-it-INT4 (this) |
0.7415 |
0.9174 |
0.8496 |
0.7662 |
0.6160 |
0.7956 |
0.7810 |
0.9944 |
Reproduction
The results were obtained using the following commands:
MODEL=ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"
lm_eval \
--model vllm \
--model_args $MODEL_ARGS \
--tasks openllm \
--batch_size auto
đ License
The license of this model is gemma.
Property |
Details |
Model Type |
gemma-3-27b-it-GPTQ-4b-128g |
Training Data |
Not mentioned in the original README |
Library Name |
transformers |
Pipeline Tag |
image-text-to-text |
Tags |
int4, vllm, llmcompressor |
Base Model |
google/gemma-3-27b-it |
License |
gemma |