P

Paligemma2 10b Pt 448

Developed by google
PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.
Downloads 282
Release Time : 11/21/2024

Model Overview

A multimodal model built on the SigLIP vision model and Gemma 2 language model, optimized for vision-language tasks, supporting multilingual and fine-tuning for various downstream tasks.

Model Features

Multimodal Understanding
Processes both image and text inputs for cross-modal understanding and generation.
Multi-Task Adaptation
Supports various tasks including image captioning, visual question answering, text reading, object detection, and segmentation.
High-Resolution Processing
Supports 448×448 pixel image input for enhanced fine-grained visual understanding.
Responsible AI
Training data undergoes strict safety filtering to remove inappropriate content and private information.

Model Capabilities

Image Caption Generation
Visual Question Answering
Multilingual Text Generation
Object Detection
Image Segmentation
Short Video Understanding

Use Cases

Content Understanding
Automatic Image Tagging
Generates descriptive text labels for images.
Achieves a CIDEr score of 142.4 on the COCO-35L English test set.
Visual Question Answering System
Answers natural language questions about image content.
Achieves 70.8% accuracy on the AOKVQA validation set.
Document Processing
Document Visual Question Answering
Understands text and layout in scanned documents.
Achieves 76.6% accuracy on the DocVQA validation set.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase