PaliGemma Open-Source Lightweight Vision-Language Model - Free Support for Multi-Language Image-Text Understanding and Generation

Paligemma 3b Ft Docvqa 896

Developed by google

PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.

Image-to-Text

Transformers

#Multimodal Visual Question Answering #Multilingual Image Captioning #Document Image Understanding

Downloads 519

Release Time : 5/12/2024

Model Overview

A multi-functional vision-language model that receives image and text inputs and generates text outputs, supporting tasks such as image captioning, visual question answering, text reading, object detection, and segmentation.

Model Features

Lightweight and Efficient

With only 3 billion parameters, it reduces the computational resource requirements while maintaining high performance.

Multi-task Support

It can support various vision-language tasks such as question answering, captioning, detection, and segmentation through task prefix configuration.

Multilingual Capability

The pre-trained data covers 35 languages, supporting cross-lingual image understanding and generation.

Responsible AI

The training data has undergone strict content security filtering and ethical review.

Model Capabilities

Image Caption Generation

Visual Question Answering

Document Understanding

Object Detection

Image Segmentation

Multilingual Text Generation

Use Cases

Document Processing

DocVQA Document Question Answering

Extract information from scanned documents or images and answer questions.

Specifically fine-tuned on the DocVQA dataset.

Content Moderation

Image Security Detection

Identify sensitive or inappropriate content in images.

Toxicity detection is implemented through the Perspective API.

Multilingual Applications

Cross-lingual Image Captioning

Generate image captions in different languages.

Examples show Spanish captioning ability.

🚀 PaliGemma Model Card

PaliGemma is a versatile and lightweight vision - language model. It takes both images and text as input and generates text output, supporting multiple languages. It's suitable for a wide range of vision - language tasks and is mainly for research purposes.

🚀 Quick Start

To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately. Acknowledge license

✨ Features

Versatile Input and Output: Accepts both image and text as input and generates text output, supporting multiple languages.
Fine - tuned on DocVQA: Fine - tuned with 896*896 input images on the DocVQA dataset.
Multiple Formats Available: Available in float32, bfloat16 and float16 format for research purposes.

📦 Installation

To run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes:

pip install bitsandbytes accelerate

💻 Usage Examples

Basic Usage

Running the default precision (float32) on CPU:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Advanced Usage

Running bfloat16 on an nvidia CUDA card:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Loading in 4 - bit / 8 - bit:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from transformers import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

📚 Documentation

Model information

Model summary

PaliGemma is a vision - language model inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It has a total of 3 billion params, composed of a Transformer decoder and a Vision Transformer image encoder.

Property	Details
Model Type	Vision - Language Model
Training Data	WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, WIT

Model data

PaliGemma is pre - trained on a mixture of datasets. Data responsibility filtering is applied to ensure clean training data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering and additional methods.

How to Use

PaliGemma is a single - turn vision language model not for conversational use. It works best when fine - tuning to a specific use case. You can configure tasks with task prefixes. For interactive testing, use the "mix" family of models.

Implementation information

Hardware

PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).

Software

Training was done using JAX, Flax, TFDS and [big_vision](https://github.com/google - research/big_vision).

Evaluation information

Benchmark results

The transferability of PaliGemma is verified by fine - tuning on various academic tasks. Results are reported on different resolutions.

Mix model (fine - tune on mixture of transfer tasks):

Benchmark	Metric (split)	mix - 224	mix - 448
MMVP	Paired Accuracy	46.00	45.33
POPE	Accuracy (random/popular/adversarial)	88.00 86.63 85.67	89.37 88.40 87.47
GQA	Accuracy (test)	65.20	65.47

Single task (fine - tune on single task): The table shows the results of fine - tuning on single tasks at different resolutions.

🔧 Technical Details

PaliGemma's fine - tune config is available at [big_vision](https://github.com/google - research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/docvqa.py). The model is trained following the PaLI - 3 recipes.

📄 License

The model is under the Gemma license.

Model page: PaliGemma Resources and technical documentation:

Responsible Generative AI Toolkit
PaliGemma on Kaggle
[PaliGemma on Vertex Model Garden](https://console.cloud.google.com/vertex - ai/publishers/google/model - garden/363) Terms of Use: Terms Authors: Google

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご