P

Paligemma 3b Mix 448

Developed by google
PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs
Downloads 5,488
Release Time : 5/13/2024

Model Overview

PaliGemma is a 3-billion-parameter vision-language model that accepts 448*448 input images and 512-token text sequences, fine-tuned on multiple downstream academic datasets. It supports various tasks including image captioning, visual question answering, text reading, object detection, and segmentation

Model Features

Versatile vision-language capabilities
Supports various vision-language tasks including image captioning, visual question answering, text reading, object detection, and segmentation
Multilingual support
Capable of processing text inputs and outputs in multiple languages
Lightweight design
Only 3 billion parameters, making it more lightweight and efficient compared to similar models
High-quality pretraining data
Pretrained using rigorously filtered datasets like WebLI to ensure data quality and safety

Model Capabilities

Image caption generation
Visual question answering
Text reading
Object detection
Object segmentation
Multilingual text generation

Use Cases

Content generation
Multilingual image captioning
Generate descriptive text for images in different languages
Example output: 'Un auto azul estacionado frente a un edificio.' (Spanish description)
Visual understanding
Visual question answering
Answer natural language questions about image content
Computer vision
Object detection
Identify objects in images and output bounding box coordinates
Image segmentation
Segment objects within images
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase