P

Paligemma 3b Ft Rsvqa Lr 224

Developed by google
PaliGemma is a multi-functional lightweight vision-language model (VLM) that combines image and text inputs to generate text outputs and supports multiple languages.
Downloads 223
Release Time : 5/12/2024

Model Overview

PaliGemma is built on open components and is suitable for various vision-language tasks, such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.

Model Features

Multimodal Input
Process image and text inputs simultaneously to generate text outputs
Multi-Task Support
Support various vision-language tasks, including caption generation, visual question answering, object detection, and segmentation
Multilingual Capability
Support multi-language processing, suitable for international application scenarios
Lightweight Design
A lightweight model with 3 billion parameters, suitable for various deployment scenarios

Model Capabilities

Image Caption Generation
Visual Question Answering
Object Detection
Object Segmentation
Multilingual Processing
Text Reading

Use Cases

Content Generation
Multilingual Image Caption
Generate descriptive captions in multiple languages for images
The CIDEr score reaches 141.2 (English) on the COCO-35L dataset
Visual Question Answering
Complex Visual Question Answering
Answer complex questions about image content
The accuracy reaches 85.64% on the VQAv2 test set
Document Analysis
Document Visual Question Answering
Extract information from document images and answer questions
The ANLS reaches 84.77 on the DocVQA test set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase