P

Paligemma 3b Ft Ocrvqa 448

Developed by google
PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.
Downloads 365
Release Time : 5/12/2024

Model Overview

A 3B-parameter model fine-tuned on the OCR-VQA dataset with 448*448 input images, specifically designed for vision-language tasks such as image captioning, visual question answering, text reading, etc.

Model Features

Lightweight and Versatile
Only 3 billion parameters yet capable of handling multiple vision-language tasks.
Multi-Resolution Support
Supports various input resolutions like 224/448/896 to adapt to different task requirements.
Task Prefix Configuration
Flexibly configures model tasks through task prefixes (e.g., 'detect' or 'segment').
Responsible Data Filtering
Training data undergoes strict content safety and personal information filtering.

Model Capabilities

Image Captioning
Visual Question Answering
Text Reading
Object Detection
Image Segmentation
Multilingual Processing

Use Cases

Document Processing
OCR-VQA
Answer questions based on text content within images
Test accuracy 74.93% (896 resolution)
DocVQA
Document image question answering
ANLS 84.77 (896 resolution)
General Visual Understanding
Image Captioning
Generate multilingual descriptions for images
COCO dataset CIDEr 144.60 (448 resolution)
Visual Question Answering
Answer questions about image content
VQAv2 test accuracy 85.64%
Specialized Domains
Scientific Chart Understanding
Parse content from scientific charts
SciCap test CIDEr 181.49
Remote Sensing Image Analysis
Answer questions about remote sensing images
RSVQA-HR test accuracy 92.79%
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase