P

Paligemma 3b Ft Docvqa 896

Developed by google
PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.
Downloads 519
Release Time : 5/12/2024

Model Overview

A multi-functional vision-language model that receives image and text inputs and generates text outputs, supporting tasks such as image captioning, visual question answering, text reading, object detection, and segmentation.

Model Features

Lightweight and Efficient
With only 3 billion parameters, it reduces the computational resource requirements while maintaining high performance.
Multi-task Support
It can support various vision-language tasks such as question answering, captioning, detection, and segmentation through task prefix configuration.
Multilingual Capability
The pre-trained data covers 35 languages, supporting cross-lingual image understanding and generation.
Responsible AI
The training data has undergone strict content security filtering and ethical review.

Model Capabilities

Image Caption Generation
Visual Question Answering
Document Understanding
Object Detection
Image Segmentation
Multilingual Text Generation

Use Cases

Document Processing
DocVQA Document Question Answering
Extract information from scanned documents or images and answer questions.
Specifically fine-tuned on the DocVQA dataset.
Content Moderation
Image Security Detection
Identify sensitive or inappropriate content in images.
Toxicity detection is implemented through the Perspective API.
Multilingual Applications
Cross-lingual Image Captioning
Generate image captions in different languages.
Examples show Spanish captioning ability.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase