P

Paligemma 3b Ft Refcoco Seg 896

Developed by google
PaliGemma is a lightweight vision-language model developed by Google, built upon the SigLIP vision model and Gemma language model, supporting multilingual text generation and visual understanding tasks.
Downloads 20
Release Time : 5/12/2024

Model Overview

A versatile vision-language model that accepts image and text inputs to generate text outputs, supporting tasks such as image captioning, visual question answering, object detection, and segmentation.

Model Features

Lightweight Design
With only 3 billion parameters, it is suitable for deployment on various hardware platforms.
Multi-task Support
Supports various vision-language tasks such as Q&A, captioning, and segmentation through task prefix configuration.
Multilingual Capability
Supports text generation and understanding in multiple languages.
High-Resolution Processing
Supports input image resolutions up to 896×896 pixels.

Model Capabilities

Image caption generation
Visual question answering
Object detection
Image segmentation
Multilingual text generation
Text reading comprehension

Use Cases

Computer Vision
Image Captioning
Generates multilingual descriptions for input images.
Achieves a CIDEr score of 144.60 on the COCO caption validation set.
Visual Question Answering
Answers natural language questions about image content.
Achieves an accuracy of 85.64% on the VQAv2 test set.
Document Processing
Document Question Answering
Understands document image content and answers questions.
Achieves an ANLS score of 84.77 on the DocVQA test set.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase