🚀 PaliGemma Model Card
PaliGemma is a vision - language model that takes both image and text as input and generates text output. It's fine - tuned on the SciCap dataset and offers weights in multiple formats for research purposes.
🚀 Quick Start
To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click the "Acknowledge license" button below. Requests are processed immediately.
✨ Features
- Versatile Input: Accepts both image and text as input.
- Multiple Formats: Available in float32, bfloat16, and float16 formats.
- Rich Capabilities: Capable of tasks like image captioning, question answering, object detection, and segmentation.
📦 Installation
To automatically run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes
and accelerate
:
pip install bitsandbytes accelerate
💻 Usage Examples
Basic Usage
Running the default precision (float32
) on CPU:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Advanced Usage
Running other precisions on CUDA:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Loading in 4 - bit / 8 - bit:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
📚 Documentation
Model information
Model summary
PaliGemma is a versatile and lightweight vision - language model (VLM) inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It takes both image and text as input and generates text as output, supporting multiple languages.
- Model architecture: It is the composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14.
- Inputs and outputs:
- Input: Image and text string, such as a prompt to caption the image, or a question.
- Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.
Model data
- Pre - train datasets: PaliGemma is pre - trained on a mixture of datasets including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT.
- Data responsibility filtering: Filters are applied to WebLI to ensure clean training data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering, and additional content - based filtering.
How to Use
PaliGemma is a single - turn vision language model not meant for conversational use, and it works best when fine - tuning to a specific use case. You can configure which task the model will solve by conditioning it with task prefixes, such as “detect” or “segment”.
Implementation information
- Hardware: PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).
- Software: Training was done using JAX, Flax, TFDS, and
big_vision
.
Evaluation information
Benchmark results
- Mix model (fine - tune on mixture of transfer tasks): Results are reported on benchmarks like MMVP, POPE, and GQA at different resolutions (mix - 224 and mix - 448).
- Single task (fine - tune on single task): Results are presented for various tasks including captioning, question answering, segmentation, and video tasks at different resolutions (pt - 224, pt - 448, pt - 896).
Benchmark Type |
Benchmark |
Metric (split) |
mix - 224 |
mix - 448 |
pt - 224 |
pt - 448 |
pt - 896 |
Mix model |
MMVP |
Paired Accuracy |
46.00 |
45.33 |
- |
- |
- |
Mix model |
POPE |
Accuracy (random/popular/adversarial) |
88.00 86.63 85.67 |
89.37 88.40 87.47 |
- |
- |
- |
Mix model |
GQA |
Accuracy (test) |
65.20 |
65.47 |
- |
- |
- |
Single task (Captioning) |
COCO captions (train+restval) |
CIDEr (val) |
- |
- |
141.92 |
144.60 |
- |
Single task (Captioning) |
NoCaps (Eval of COCO captions transfer) |
CIDEr (val) |
- |
- |
121.72 |
123.58 |
- |
... |
... |
... |
... |
... |
... |
... |
... |
🔧 Technical Details
PaliGemma's architecture combines a Transformer decoder and a Vision Transformer image encoder. The pre - training on diverse datasets and the application of data filtering techniques contribute to its performance and reliability. The use of JAX, Flax, TFDS, and big_vision
in the training process enables efficient utilization of TPU hardware.
📄 License
The license for PaliGemma is gemma.
Terms of Use: Terms
Model page: PaliGemma
Resources and technical documentation: