P

Paligemma 3b Ft Scicap 448

Developed by google
PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs and supports multiple languages.
Downloads 123
Release Time : 5/13/2024

Model Overview

A vision-language model built on open components, suitable for various tasks such as image captioning, visual question answering, text reading, object detection, and segmentation.

Model Features

Versatility
Supports multiple vision-language tasks, including question answering, caption generation, segmentation, etc.
Multilingual Support
Can handle inputs and outputs in multiple languages, covering 35 languages.
Lightweight Design
Suitable for fine-tuning in different scenarios with low resource requirements.
Built on Open Components
Built on open components such as the SigLIP vision model and the Gemma language model.

Model Capabilities

Image Caption Generation
Visual Question Answering
Text Reading
Object Detection
Object Segmentation
Multilingual Processing

Use Cases

Image Understanding
Image Caption Generation
Generate descriptive captions for images, supporting multiple languages.
CIDEr score of 144.60 on the COCO captions validation set (448 resolution)
Visual Question Answering
Answer natural language questions about image content.
Accuracy of 85.64% on the VQAv2 test set
Document Analysis
Document Question Answering
Extract information from document images and answer questions.
ANLS score of 84.77 on the DocVQA test set (896 resolution)
Text Recognition
Recognize the text content in images.
Accuracy of 76.48% on the TextVQA test set
Object Detection and Segmentation
Object Detection
Detect and locate specific objects in images.
MIoU of 76.94 on the RefCOCO validation set (896 resolution)
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase