PaliGemma-3B Open-source Multifunctional Vision-Language Model - Supports Image and Text Input and Outputs Text Results

Paligemma 3b Ft Ocrvqa 448

Developed by google

PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.

Image-to-Text

Transformers

#Multimodal Visual Question Answering #High-Resolution Image Understanding #Multilingual Text Generation

Downloads 365

Release Time : 5/12/2024

Model Overview

A 3B-parameter model fine-tuned on the OCR-VQA dataset with 448*448 input images, specifically designed for vision-language tasks such as image captioning, visual question answering, text reading, etc.

Model Features

Lightweight and Versatile

Only 3 billion parameters yet capable of handling multiple vision-language tasks.

Multi-Resolution Support

Supports various input resolutions like 224/448/896 to adapt to different task requirements.

Task Prefix Configuration

Flexibly configures model tasks through task prefixes (e.g., 'detect' or 'segment').

Responsible Data Filtering

Training data undergoes strict content safety and personal information filtering.

Model Capabilities

Image Captioning

Visual Question Answering

Text Reading

Object Detection

Image Segmentation

Multilingual Processing

Use Cases

Document Processing

OCR-VQA

Answer questions based on text content within images

Test accuracy 74.93% (896 resolution)

DocVQA

Document image question answering

ANLS 84.77 (896 resolution)

General Visual Understanding

Image Captioning

Generate multilingual descriptions for images

COCO dataset CIDEr 144.60 (448 resolution)

Visual Question Answering

Answer questions about image content

VQAv2 test accuracy 85.64%

Specialized Domains

Scientific Chart Understanding

Parse content from scientific charts

SciCap test CIDEr 181.49

Remote Sensing Image Analysis

Answer questions about remote sensing images

RSVQA-HR test accuracy 92.79%

🚀 PaliGemma Model Card

PaliGemma is a versatile and lightweight vision - language model. It takes both image and text as input and generates text output, supporting multiple languages. It's fine - tuned on the OCR - VQA dataset and available in various formats for research purposes.

🚀 Quick Start

PaliGemma is a single - turn vision language model not for conversational use. It works best when fine - tuned to a specific use case. You can configure tasks with task prefixes. For interactive testing, use the "mix" family of models. Refer to the usage and limitations section or the blog post for details.

✨ Features

Versatile Input and Output: Accepts both image and text as input and generates text output, supporting multiple languages.
Rich Capabilities: Capable of question answering, captioning, segmentation, etc., when fine - tuned.
Multiple Formats: Available in float32, bfloat16 and float16 formats for research.

📦 Installation

To use the model in 4 - bit / 8 - bit precision, you need to install bitsandbytes and accelerate:

pip install bitsandbytes accelerate

💻 Usage Examples

Basic Usage

Running the default precision (`float32`) on CPU

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Output: Un auto azul estacionado frente a un edificio.

Advanced Usage

Running other precisions on CUDA

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Loading in 4 - bit / 8 - bit

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

📚 Documentation

Model information

Model summary

Description

PaliGemma is a versatile and lightweight vision - language model (VLM) inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It supports multiple languages and is designed for class - leading fine - tune performance on various vision - language tasks.

Model architecture

PaliGemma is composed of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14. It is trained following the PaLI - 3 recipes.

Inputs and outputs

Input: Image and text string, such as a prompt to caption the image or a question.
Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.

Model data

Pre - train datasets

PaliGemma is pre - trained on a mixture of datasets, including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT.

Data responsibility filtering

Filters are applied to WebLI to train PaliGemma on clean data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering, and additional methods based on content quality and safety.

Implementation information

Hardware

PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).

Software

Training was done using JAX, Flax, TFDS and big_vision.

Evaluation information

Benchmark results

Mix model (fine - tune on mixture of transfer tasks)

Benchmark	Metric (split)	mix - 224	mix - 448
MMVP	Paired Accuracy	46.00	45.33
POPE	Accuracy (random/popular/adversarial)	88.00 86.63 85.67	89.37 88.40 87.47
GQA	Accuracy (test)	65.20	65.47

Single task (fine - tune on single task)

Benchmark (train split)	Metric (split)	pt - 224	pt - 448	pt - 896
Captioning	COCO cap

🔧 Technical Details

Model page

PaliGemma

Resources and technical documentation

Terms of Use

Terms

Authors

Google

Extra Gated Information

Access PaliGemma on Hugging Face: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click below. Requests are processed immediately.
Button Content: Acknowledge license

📄 License

The license for this model is gemma.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご