🚀 PaliGemma Model Card
PaliGemma is a versatile and lightweight vision - language model. It takes both images and text as input and generates text output, supporting multiple languages. It's suitable for a wide range of vision - language tasks and is mainly for research purposes.
🚀 Quick Start
To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately.
Acknowledge license
✨ Features
- Versatile Input and Output: Accepts both image and text as input and generates text output, supporting multiple languages.
- Fine - tuned on DocVQA: Fine - tuned with 896*896 input images on the DocVQA dataset.
- Multiple Formats Available: Available in float32, bfloat16 and float16 format for research purposes.
📦 Installation
To run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes
:
pip install bitsandbytes accelerate
💻 Usage Examples
Basic Usage
Running the default precision (float32
) on CPU:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Advanced Usage
Running bfloat16
on an nvidia CUDA card:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Loading in 4 - bit / 8 - bit:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from transformers import BitsAndBytesConfig
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
📚 Documentation
Model information
Model summary
PaliGemma is a vision - language model inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It has a total of 3 billion params, composed of a Transformer decoder and a Vision Transformer image encoder.
Property |
Details |
Model Type |
Vision - Language Model |
Training Data |
WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, WIT |
Model data
PaliGemma is pre - trained on a mixture of datasets. Data responsibility filtering is applied to ensure clean training data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering and additional methods.
How to Use
PaliGemma is a single - turn vision language model not for conversational use. It works best when fine - tuning to a specific use case. You can configure tasks with task prefixes. For interactive testing, use the "mix" family of models.
Implementation information
Hardware
PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).
Software
Training was done using JAX, Flax, TFDS and [big_vision
](https://github.com/google - research/big_vision).
Evaluation information
Benchmark results
The transferability of PaliGemma is verified by fine - tuning on various academic tasks. Results are reported on different resolutions.
Mix model (fine - tune on mixture of transfer tasks):
Benchmark |
Metric (split) |
mix - 224 |
mix - 448 |
MMVP |
Paired Accuracy |
46.00 |
45.33 |
POPE |
Accuracy (random/popular/adversarial) |
88.00 86.63 85.67 |
89.37 88.40 87.47 |
GQA |
Accuracy (test) |
65.20 |
65.47 |
Single task (fine - tune on single task):
The table shows the results of fine - tuning on single tasks at different resolutions.
🔧 Technical Details
PaliGemma's fine - tune config is available at [big_vision](https://github.com/google - research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/docvqa.py). The model is trained following the PaLI - 3 recipes.
📄 License
The model is under the Gemma license.
Model page: PaliGemma
Resources and technical documentation: