P

Paligemma2 10b Pt 224

Developed by google
PaliGemma 2 is a vision-language model (VLM) that combines the capabilities of the Gemma 2 model. It can process both image and text inputs simultaneously and generate text outputs, supporting multiple languages. It is suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.
Downloads 3,362
Release Time : 11/21/2024

Model Overview

PaliGemma 2 is an updated version of the PaliGemma vision-language model, integrating the capabilities of the Gemma 2 model. It is built on open components, such as the SigLIP vision model and the Gemma 2 language model, aiming to achieve leading fine-tuning performance on a wide range of vision-language tasks.

Model Features

Multimodal Processing
Capable of processing both image and text inputs simultaneously and generating text outputs.
Multilingual Support
Supports multiple languages, suitable for users in different regions.
High-performance Fine-tuning
Designed to achieve leading fine-tuning performance on various vision-language tasks.
Built on Open Components
Built on the SigLIP vision model and the Gemma 2 language model, with high flexibility and scalability.

Model Capabilities

Image Caption Generation
Visual Question Answering
Text Reading
Object Detection
Object Segmentation
Multilingual Processing

Use Cases

Image and Video Understanding
Image Caption Generation
Generate descriptive captions for images.
On the COCO-35L dataset, the English caption score is 142.4 (10B model).
Short Video Captioning
Generate descriptive captions for short videos.
On the ActivityNet-CAP dataset, the score is 35.9 (10B model).
Visual Question Answering
Open Knowledge Visual Question Answering
Answer visual questions that require external knowledge.
On the AOKVQA-DA validation set, the score is 68.9 (10B model).
Science Question Answering
Answer science-related visual questions.
On the ScienceQA dataset, the accuracy reaches 98.2% (10B model).
Document Processing
Document Question Answering
Answer questions based on document images.
On the DocVQA validation set, the score is 43.9 (10B model with 224 resolution).
Table Understanding
Parse and understand table images.
On the FinTabNet dataset, the TEDS score is 98.94 (3B model).
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase