P

Paligemma 3b Ft Vqav2 448

Developed by google
PaliGemma is a lightweight vision-language model developed by Google, combining image understanding and text generation capabilities, supporting multilingual tasks.
Downloads 121
Release Time : 5/12/2024

Model Overview

A 3B-parameter vision-language model fine-tuned on the VQAv2 dataset with 448*448 input images, accepting image and text inputs to generate text outputs, suitable for visual question answering, image captioning, and similar tasks.

Model Features

Multimodal Understanding
Processes both image and text inputs simultaneously for cross-modal semantic understanding
Lightweight Architecture
Compact design with only 3 billion parameters, suitable for research scenario deployment
Task Prefix Configuration
Flexibly switches between different vision-language tasks using prefixes like 'caption'
Multi-Precision Support
Offers float32/bfloat16/float16 and 4-bit/8-bit quantized versions

Model Capabilities

Visual Question Answering
Multilingual Image Captioning
Object Detection
Image Segmentation
Cross-Modal Reasoning

Use Cases

Visual Understanding
Multilingual Image Captioning
Generates image captions in languages such as Spanish
Example output: 'Un auto azul estacionado frente a un edificio.'
Visual Question Answering
Answers natural language questions about image content
Fine-tuned on the VQAv2 dataset
Industrial Applications
Object Detection
Identifies object locations in images using the 'detect' prefix
Outputs a list of bounding box coordinates
Document Analysis
Understands content in images containing text
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase