P

Paligemma2 3b Pt 896

Developed by google
PaliGemma 2 is a multimodal vision-language model that combines image and text inputs to generate text outputs. It supports multiple languages and is suitable for various vision-language tasks.
Downloads 2,536
Release Time : 11/21/2024

Model Overview

PaliGemma 2 is a vision-language model built on Gemma 2 and SigLIP. It supports image and text inputs, generates text outputs, and is suitable for various tasks such as image captioning, visual question answering, and text reading.

Model Features

Multimodal Input and Output
Accepts images and text as inputs, generates text outputs, and supports multiple languages.
Extensive Task Support
Suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.
High-Performance Fine-Tuning
Has leading fine-tuning performance on various vision-language tasks.
Responsible Data Filtering
Conducted multiple filters on the pre-training data, such as pornographic, toxic, and personal information filtering, to ensure the model's safety and responsibility.

Model Capabilities

Image Caption Generation
Visual Question Answering
Object Detection
Object Segmentation
Multilingual Text Generation
Image Understanding
Text Reading

Use Cases

Image and Video Understanding
Image Caption Generation
Generate descriptive captions for images.
On the COCO-35L dataset, the English caption score is 142.4 (3B model)
Visual Question Answering
Answer questions about the content of images.
Achieved 85.8% accuracy on the VQAv2 dataset (28B model)
Education
Visual Learning Assistance
Help students understand the information in images.
Achieved 98.6% accuracy on the ScienceQA dataset (28B model)
Document Processing
Table Understanding
Parse and understand the table content in documents.
The TEDS score is 98.94 on the FinTabNet dataset (3B model)
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase