P

Paligemma2 10b Pt 896

Developed by google
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output
Downloads 233
Release Time : 11/21/2024

Model Overview

A multimodal model built on the SigLIP vision model and Gemma 2 language model, excelling in vision-language tasks such as image captioning, visual question answering, text reading, object detection, and segmentation

Model Features

Multimodal Understanding
Processes both image and text inputs simultaneously to achieve cross-modal understanding and generation
High-Resolution Support
Supports high-resolution image input (896×896), enhancing detail understanding capabilities
Multi-Task Adaptation
Can be fine-tuned for various vision-language tasks, including detection, segmentation, and question answering
Responsible AI
Training data undergoes strict safety filtering to remove inappropriate content and personal sensitive information

Model Capabilities

Image Caption Generation
Visual Question Answering
Multilingual Text Generation
Object Detection
Image Segmentation
Text Reading
Short Video Understanding

Use Cases

Content Understanding
Automatic Image Annotation
Generates descriptive text for images
Achieves a CIDEr score of 142.4 for English captions on the COCO-35L dataset
Document Parsing
Extracts and interprets text from scanned documents
Achieves 76.6% accuracy on the DocVQA validation set
Intelligent Interaction
Visual Question Answering System
Answers complex questions about image content
Achieves 87% accuracy on the AOKVQA multiple-choice task
Chart Understanding
Parses and interprets chart data
Achieves 66.4% accuracy on human-annotated ChartQA data
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase