P

Paligemma 3b Pt 896

Developed by google
PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.
Downloads 1,788
Release Time : 5/13/2024

Model Overview

PaliGemma is designed for a wide range of vision-language tasks, such as image captioning, visual question answering, text reading, object detection, and segmentation, aiming to achieve state-of-the-art fine-tuning performance.

Model Features

Versatility
Supports image and text inputs and can handle various vision-language tasks
Multilingual Support
Can handle inputs and outputs in multiple languages
Lightweight Design
The model has a moderate number of parameters, making it easy to use in different scenarios
High-Performance Fine-Tuning
Designed to achieve state-of-the-art fine-tuning performance in vision-language tasks

Model Capabilities

Image Caption Generation
Visual Question Answering
Text Reading
Object Detection
Object Segmentation
Multilingual Processing

Use Cases

Image Understanding
Image Caption Generation
Generate descriptive text for images
The CIDEr score is 144.60 on the COCO caption dataset
Visual Question Answering
Answer questions about the content of images
The accuracy is 85.64% on the VQAv2 dataset
Document Processing
Document Question Answering
Answer questions about the content of documents
The ANLS score is 84.77 on the DocVQA dataset
Text Reading
Recognize and understand text in images
The accuracy is 76.48% on the TextVQA dataset
Computer Vision
Object Detection
Identify and locate objects in images
Object Segmentation
Identify objects in images and perform pixel-level segmentation
The maximum MIoU is 76.94 on the RefCOCO dataset
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase