T

Test Push

Developed by tarekziade
distilvit is an image-to-text model based on a VIT image encoder and a distilled GPT-2 text decoder, capable of generating textual descriptions of images.
Downloads 17
Release Time : 6/21/2024

Model Overview

This model is primarily used for image caption generation tasks, converting input images into corresponding textual descriptions. Based on the VIT and distilled GPT-2 architecture, it has been fine-tuned on datasets such as Flickr30k and COCO.

Model Features

Efficient Architecture
Uses a distilled GPT-2 as the text decoder, reducing model complexity while maintaining performance.
Multi-Dataset Training
Trained and fine-tuned on multiple image caption datasets, including Flickr30k and COCO.
Debiasing
Trained on a debiased version of the Flickr30k dataset to reduce model biases.

Model Capabilities

Image Caption Generation
Image-to-Text
Vision-Language Understanding

Use Cases

Image Understanding
Automatic Image Tagging
Automatically generates descriptive text for images.
ROUGE-1 score 43.006
Assistance for Visually Impaired
Converts image content into spoken descriptions.
Content Management
Image Search Engine Optimization
Automatically generates metadata for images.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase