P

Paligemma 3b Mix 224

Developed by google
PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.
Downloads 143.03k
Release Time : 5/12/2024

Model Overview

PaliGemma accepts images and text as input and generates text output, supporting multiple languages. Designed for a wide range of vision-language tasks such as image captioning, visual question answering, text reading, object detection, and segmentation.

Model Features

Multi-task Support
Configurable via task prefixes to address various vision-language tasks such as detection, segmentation, Q&A, etc.
Lightweight Design
Compact model with only 3 billion parameters, suitable for diverse application scenarios
Multilingual Capability
Supports text generation and understanding in multiple languages
Responsible AI
Training data strictly filtered to remove unsafe, toxic, and sensitive content

Model Capabilities

Image caption generation
Visual question answering
Text reading
Object detection
Object segmentation
Multilingual text generation

Use Cases

Content Understanding
Image caption generation
Generate descriptive text for input images
CIDEr score of 144.60 on COCO caption validation set (448 resolution)
Intelligent Interaction
Visual question answering
Answer questions about image content
Computer Vision
Object detection
Detect objects in images and output bounding box coordinates
Image segmentation
Segment objects in images
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase