P

Paligemma2 3b Pt 224

Developed by google
PaliGemma 2 is a vision-language model (VLM) developed by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting multilingual vision-language tasks.
Downloads 30.51k
Release Time : 11/21/2024

Model Overview

PaliGemma 2 is a vision-language model based on Gemma 2 and SigLIP, accepting image and text inputs to generate text outputs, suitable for various tasks such as image captioning and visual question answering.

Model Features

Multimodal Processing Capability
Processes both image and text inputs to generate text outputs
Multilingual Support
Supports vision-language tasks in multiple languages
High-Resolution Adaptation
Supports input resolutions of 224x224 and 448x448
Responsible AI
Training data strictly filtered to remove unsafe content

Model Capabilities

Image Captioning
Visual Question Answering
Text Reading
Object Detection
Image Segmentation
Multilingual Processing

Use Cases

Content Understanding
Image Caption Generation
Generates detailed descriptions for input images
Achieves CIDEr score of 142.4 for English captions on COCO-35L dataset
Visual Question Answering
Answers questions about image content
70.2% accuracy on AOKVQA-DA validation set
Document Processing
Document Question Answering
Extracts information from document images to answer questions
76.1% accuracy on DocVQA validation set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase