Test Push
distilvit is an image-to-text model based on a VIT image encoder and a distilled GPT-2 text decoder, capable of generating textual descriptions of images.
Downloads 17
Release Time : 6/21/2024
Model Overview
This model is primarily used for image caption generation tasks, converting input images into corresponding textual descriptions. Based on the VIT and distilled GPT-2 architecture, it has been fine-tuned on datasets such as Flickr30k and COCO.
Model Features
Efficient Architecture
Uses a distilled GPT-2 as the text decoder, reducing model complexity while maintaining performance.
Multi-Dataset Training
Trained and fine-tuned on multiple image caption datasets, including Flickr30k and COCO.
Debiasing
Trained on a debiased version of the Flickr30k dataset to reduce model biases.
Model Capabilities
Image Caption Generation
Image-to-Text
Vision-Language Understanding
Use Cases
Image Understanding
Automatic Image Tagging
Automatically generates descriptive text for images.
ROUGE-1 score 43.006
Assistance for Visually Impaired
Converts image content into spoken descriptions.
Content Management
Image Search Engine Optimization
Automatically generates metadata for images.
Featured Recommended AI Models
Š 2025AIbase