PaliGemma 3B FT Scicap 224: An Open-Source Lightweight Vision-Language Model - Supporting Multilingual Image-Text Information Output

Paligemma 3b Ft Scicap 224

Developed by google

PaliGemma is a lightweight vision-language model that combines image and text inputs to generate text outputs, supporting multilingual and multi-task processing.

Image-to-Text

Transformers

#Multimodal Vision-Language #Multilingual Support #Lightweight VLM

Downloads 107

Release Time : 5/12/2024

Model Overview

PaliGemma is a versatile vision-language model inspired by PaLI-3, built on open components and suitable for various tasks such as image captioning, visual question answering, text reading, object detection, and segmentation.

Model Features

Multimodal Input

Supports simultaneous processing of image and text inputs to generate text outputs.

Multilingual Support

Capable of handling inputs and outputs in multiple languages, suitable for international application scenarios.

Lightweight Design

Built on open components with a moderate parameter scale, suitable for environments with limited resources.

Multifunctional Task Processing

Supports various vision-language tasks, including question answering, caption generation, and segmentation.

Model Capabilities

Image Caption Generation

Visual Question Answering

Object Detection

Object Segmentation

Multilingual Text Generation

Use Cases

Image Understanding

Image Caption Generation

Generate descriptive captions for images, supporting multiple languages.

Generate accurate captions that match the image content.

Visual Question Answering

Answer natural language questions about the image content.

Provide accurate and relevant answers.

Object Detection and Segmentation

Object Detection

Identify objects in the image and return their bounding box coordinates.

Accurate object localization.

Object Segmentation

Perform pixel-level segmentation of objects in the image.

Generate accurate segmentation masks.

🚀 PaliGemma Model Card

PaliGemma is a vision - language model that takes both image and text as input and generates text output. It's fine - tuned on the SciCap dataset and offers weights in multiple formats for research purposes.

🚀 Quick Start

To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged - in to Hugging Face and click the "Acknowledge license" button below. Requests are processed immediately.

✨ Features

Versatile Input: Accepts both image and text as input.
Multiple Formats: Available in float32, bfloat16, and float16 formats.
Rich Capabilities: Capable of tasks like image captioning, question answering, object detection, and segmentation.

📦 Installation

To automatically run inference using 8 - bit or 4 - bit precision, you need to install bitsandbytes and accelerate:

pip install bitsandbytes accelerate

💻 Usage Examples

Basic Usage

Running the default precision (float32) on CPU:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Advanced Usage

Running other precisions on CUDA:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Loading in 4 - bit / 8 - bit:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from bitsandbytes.nn import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

📚 Documentation

Model information

Model summary

PaliGemma is a versatile and lightweight vision - language model (VLM) inspired by PaLI - 3 and based on open components such as the SigLIP vision model and the Gemma language model. It takes both image and text as input and generates text as output, supporting multiple languages.

Model architecture: It is the composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma - 2B, and the image encoder is initialized from SigLIP - So400m/14.
Inputs and outputs:
- Input: Image and text string, such as a prompt to caption the image, or a question.
- Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.

Model data

Pre - train datasets: PaliGemma is pre - trained on a mixture of datasets including WebLI, CC3M - 35L, VQ²A - CC3M - 35L/VQG - CC3M - 35L, OpenImages, and WIT.
Data responsibility filtering: Filters are applied to WebLI to ensure clean training data, including pornographic image filtering, text safety filtering, text toxicity filtering, text personal information filtering, and additional content - based filtering.

How to Use

PaliGemma is a single - turn vision language model not meant for conversational use, and it works best when fine - tuning to a specific use case. You can configure which task the model will solve by conditioning it with task prefixes, such as “detect” or “segment”.

Implementation information

Hardware: PaliGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).
Software: Training was done using JAX, Flax, TFDS, and big_vision.

Evaluation information

Benchmark results

Mix model (fine - tune on mixture of transfer tasks): Results are reported on benchmarks like MMVP, POPE, and GQA at different resolutions (mix - 224 and mix - 448).
Single task (fine - tune on single task): Results are presented for various tasks including captioning, question answering, segmentation, and video tasks at different resolutions (pt - 224, pt - 448, pt - 896).

Benchmark Type	Benchmark	Metric (split)	mix - 224	mix - 448	pt - 224	pt - 448	pt - 896
Mix model	MMVP	Paired Accuracy	46.00	45.33	-	-	-
Mix model	POPE	Accuracy (random/popular/adversarial)	88.00 86.63 85.67	89.37 88.40 87.47	-	-	-
Mix model	GQA	Accuracy (test)	65.20	65.47	-	-	-
Single task (Captioning)	COCO captions (train+restval)	CIDEr (val)	-	-	141.92	144.60	-
Single task (Captioning)	NoCaps (Eval of COCO captions transfer)	CIDEr (val)	-	-	121.72	123.58	-
...	...	...	...	...	...	...	...

🔧 Technical Details

PaliGemma's architecture combines a Transformer decoder and a Vision Transformer image encoder. The pre - training on diverse datasets and the application of data filtering techniques contribute to its performance and reliability. The use of JAX, Flax, TFDS, and big_vision in the training process enables efficient utilization of TPU hardware.

📄 License

The license for PaliGemma is gemma. Terms of Use: Terms

Model page: PaliGemma Resources and technical documentation:

Responsible Generative AI Toolkit
PaliGemma on Kaggle
PaliGemma on Vertex Model Garden Authors: Google

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご