🚀 PaliGemma model card
PaliGemma is a versatile and lightweight vision - language model (VLM). It can take both image and text as input and generate text as output, supporting multiple languages. This model is mainly for research purposes, with its weights available in float32, bfloat16, and float16 formats.
Model page: PaliGemma
The Transformers PaliGemma 3B weights are fine - tuned with 224*224 input images on the RSVQA - LR dataset. The fine - tune config is available at big_vision.
Resources and technical documentation:
Terms of Use: Terms
Authors: Google
✨ Features
Model information
Model summary
PaliGemma is inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model.
Property |
Details |
Model Type |
A composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14. |
Input |
Image and text string, such as a prompt to caption the image, or a question. |
Output |
Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords. |
Model data
PaliGemma is pre - trained on a mixture of datasets, including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT. And several data responsibility filters are applied to WebLI to ensure clean data for training.
How to Use
PaliGemma is a single - turn vision language model not suitable for conversational use. It works best when fine - tuned to a specific use case. You can configure the task by using task prefixes. For interactive testing, you can use the "mix" family of models.
Use in Transformers
You can use the following code snippets to run PaliGemma in different scenarios, such as running on CPU with default precision, running other precisions on CUDA, and loading in 4 - bit/8 - bit.
Implementation information
PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e) and software including JAX, Flax, TFDS, and big_vision
.
Evaluation information
The transferability of PaliGemma to a wide variety of academic tasks is verified through fine - tuning on each task and training a mix model. Benchmark results are reported on different resolutions for both the mix model and single - task fine - tuned models.
📦 Installation
To automatically run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes
:
pip install bitsandbytes accelerate
💻 Usage Examples
Basic Usage
Running the default precision (float32
) on CPU:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Advanced Usage
Running other precisions on CUDA:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Loading in 4 - bit/8 - bit:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
🔧 Technical Details
Benchmark results
Mix model (fine - tune on mixture of transfer tasks)
Benchmark |
Metric (split) |
mix - 224 |
mix - 448 |
MMVP |
Paired Accuracy |
46.00 |
45.33 |
POPE |
Accuracy (random/popular/adversarial) |
88.00 86.63 85.67 |
89.37 88.40 87.47 |
GQA |
Accuracy (test) |
65.20 |
65.47 |
Single task (fine - tune on single task)
The table shows the benchmark results for different single tasks such as captioning, question answering, segmentation, and video tasks at different resolutions (pt - 224, pt - 448, pt - 896).
📄 License
The license for this model is gemma.
⚠️ Important Note
To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately.
💡 Usage Tip
The model in the repo you are browsing may have been trained for other tasks. Please make sure you use appropriate inputs for the task at hand.