PaliGemma 3B Open-Source Vision-Language Model - Lightweight and Multifunctional, Supports Image and Text Inputs and Generates Multilingual Outputs

Paligemma 3b Ft Rsvqa Lr 224

Developed by google

PaliGemma is a multi-functional lightweight vision-language model (VLM) that combines image and text inputs to generate text outputs and supports multiple languages.

Text-to-Image

Transformers

#Multimodal Vision-Language #Lightweight VLM #Multilingual Support

Downloads 223

Release Time : 5/12/2024

Model Overview

PaliGemma is built on open components and is suitable for various vision-language tasks, such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.

Model Features

Multimodal Input

Process image and text inputs simultaneously to generate text outputs

Multi-Task Support

Support various vision-language tasks, including caption generation, visual question answering, object detection, and segmentation

Multilingual Capability

Support multi-language processing, suitable for international application scenarios

Lightweight Design

A lightweight model with 3 billion parameters, suitable for various deployment scenarios

Model Capabilities

Image Caption Generation

Visual Question Answering

Object Detection

Object Segmentation

Multilingual Processing

Text Reading

Use Cases

Content Generation

Multilingual Image Caption

Generate descriptive captions in multiple languages for images

The CIDEr score reaches 141.2 (English) on the COCO-35L dataset

Visual Question Answering

Complex Visual Question Answering

Answer complex questions about image content

The accuracy reaches 85.64% on the VQAv2 test set

Document Analysis

Document Visual Question Answering

Extract information from document images and answer questions

The ANLS reaches 84.77 on the DocVQA test set

🚀 PaliGemma model card

PaliGemma is a versatile and lightweight vision - language model (VLM). It can take both image and text as input and generate text as output, supporting multiple languages. This model is mainly for research purposes, with its weights available in float32, bfloat16, and float16 formats.

Model page: PaliGemma

The Transformers PaliGemma 3B weights are fine - tuned with 224*224 input images on the RSVQA - LR dataset. The fine - tune config is available at big_vision.

Resources and technical documentation:

Terms of Use: Terms

Authors: Google

✨ Features

Model information

Model summary

PaliGemma is inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model.

Property	Details
Model Type	A composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14.
Input	Image and text string, such as a prompt to caption the image, or a question.
Output	Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.

Model data

PaliGemma is pre - trained on a mixture of datasets, including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT. And several data responsibility filters are applied to WebLI to ensure clean data for training.

How to Use

PaliGemma is a single - turn vision language model not suitable for conversational use. It works best when fine - tuned to a specific use case. You can configure the task by using task prefixes. For interactive testing, you can use the "mix" family of models.

Use in Transformers

You can use the following code snippets to run PaliGemma in different scenarios, such as running on CPU with default precision, running other precisions on CUDA, and loading in 4 - bit/8 - bit.

Implementation information

PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e) and software including JAX, Flax, TFDS, and big_vision.

Evaluation information

The transferability of PaliGemma to a wide variety of academic tasks is verified through fine - tuning on each task and training a mix model. Benchmark results are reported on different resolutions for both the mix model and single - task fine - tuned models.

📦 Installation

To automatically run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes:

pip install bitsandbytes accelerate

💻 Usage Examples

Basic Usage

Running the default precision (float32) on CPU:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Advanced Usage

Running other precisions on CUDA:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Loading in 4 - bit/8 - bit:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

🔧 Technical Details

Benchmark results

Mix model (fine - tune on mixture of transfer tasks)

Benchmark	Metric (split)	mix - 224	mix - 448
MMVP	Paired Accuracy	46.00	45.33
POPE	Accuracy (random/popular/adversarial)	88.00 86.63 85.67	89.37 88.40 87.47
GQA	Accuracy (test)	65.20	65.47

Single task (fine - tune on single task)

The table shows the benchmark results for different single tasks such as captioning, question answering, segmentation, and video tasks at different resolutions (pt - 224, pt - 448, pt - 896).

📄 License

The license for this model is gemma.

⚠️ Important Note

To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately.

💡 Usage Tip

The model in the repo you are browsing may have been trained for other tasks. Please make sure you use appropriate inputs for the task at hand.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご