🚀 PaliGemma Model Card
PaliGemma is a versatile and lightweight vision - language model. It takes both image and text as input and generates text output, supporting multiple languages. It's fine - tuned on the OCR - VQA dataset and available in various formats for research purposes.
🚀 Quick Start
PaliGemma is a single - turn vision language model not for conversational use. It works best when fine - tuned to a specific use case. You can configure tasks with task prefixes. For interactive testing, use the "mix" family of models. Refer to the usage and limitations section or the blog post for details.
✨ Features
- Versatile Input and Output: Accepts both image and text as input and generates text output, supporting multiple languages.
- Rich Capabilities: Capable of question answering, captioning, segmentation, etc., when fine - tuned.
- Multiple Formats: Available in float32, bfloat16 and float16 formats for research.
📦 Installation
To use the model in 4 - bit / 8 - bit precision, you need to install bitsandbytes
and accelerate
:
pip install bitsandbytes accelerate
💻 Usage Examples
Basic Usage
Running the default precision (float32
) on CPU
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Output: Un auto azul estacionado frente a un edificio.
Advanced Usage
Running other precisions on CUDA
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Loading in 4 - bit / 8 - bit
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
📚 Documentation
Model information
Model summary
Description
PaliGemma is a versatile and lightweight vision - language model (VLM) inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It supports multiple languages and is designed for class - leading fine - tune performance on various vision - language tasks.
Model architecture
PaliGemma is composed of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14. It is trained following the PaLI - 3 recipes.
Inputs and outputs
- Input: Image and text string, such as a prompt to caption the image or a question.
- Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.
Model data
Pre - train datasets
PaliGemma is pre - trained on a mixture of datasets, including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT.
Data responsibility filtering
Filters are applied to WebLI to train PaliGemma on clean data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering, and additional methods based on content quality and safety.
Implementation information
Hardware
PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).
Software
Training was done using JAX, Flax, TFDS and big_vision
.
Evaluation information
Benchmark results
Mix model (fine - tune on mixture of transfer tasks)
Benchmark |
Metric (split) |
mix - 224 |
mix - 448 |
MMVP |
Paired Accuracy |
46.00 |
45.33 |
POPE |
Accuracy (random/popular/adversarial) |
88.00 86.63 85.67 |
89.37 88.40 87.47 |
GQA |
Accuracy (test) |
65.20 |
65.47 |
Single task (fine - tune on single task)
Benchmark (train split) |
Metric (split) |
pt - 224 |
pt - 448 |
pt - 896 |
Captioning |
COCO cap |
|
|
|
🔧 Technical Details
Model page
PaliGemma
Resources and technical documentation
Terms of Use
Terms
Authors
Google
Extra Gated Information
- Access PaliGemma on Hugging Face: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately.
- Button Content: Acknowledge license
📄 License
The license for this model is gemma.