P

Paligemma2 10b Ft Docci 448

Developed by google
PaliGemma 2 is a multi-functional vision-language model (VLM) launched by Google, which combines image and text processing capabilities and supports multilingual and multi-task processing.
Downloads 2,207
Release Time : 11/21/2024

Model Overview

PaliGemma 2 is a vision-language model based on the Gemma 2 architecture, capable of simultaneously processing image and text inputs and generating text outputs. This model performs excellently on various vision-language tasks, such as image description, visual question answering, and text reading.

Model Features

Multimodal Processing
Capable of simultaneously processing image and text inputs and generating text outputs
Multilingual Support
Supports multiple languages, suitable for users in different regions
High-Performance Fine-Tuning
Has excellent fine-tuning performance on various vision-language tasks
High-Resolution Support
Supports 448*448 high-resolution input image processing

Model Capabilities

Image Caption Generation
Visual Question Answering
Object Detection
Object Segmentation
Text Reading
Multilingual Processing

Use Cases

Image Understanding
Image Caption Generation
Generate detailed text descriptions for input images
English description score of 142.4 on the COCO-35L dataset
Visual Question Answering
Answer natural language questions about image content
Accuracy of 85.8% on the VQAv2 dataset
Document Processing
Document Question Answering
Extract information from document images to answer questions
Accuracy of 76.6% on the DocVQA dataset
Table Understanding
Parse and understand table content
TEDS score of 98.94 on the FinTabNet dataset
Medical Imaging
Medical Imaging Report Generation
Generate diagnostic reports based on medical images
Rouge-L score of 32.41% on the MIMIC-CXR dataset
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase