Distilvit
D
Distilvit
Developed by Mozilla
A vision-language model based on VIT image encoder and distilled GPT-2 text decoder for image caption generation tasks
Downloads 290
Release Time : 3/18/2024
Model Overview
This is an ongoing development vision-language model capable of converting images into descriptive text. The model combines visual encoding and text decoding capabilities, suitable for scenarios like automatic image annotation and assistive technologies.
Model Features
Distilled Model Architecture
Uses distilled GPT-2 as the text decoder, reducing model complexity while maintaining performance
Debiased Training Data
Utilizes processed COCO and Flickr30k datasets to reduce bias in model outputs
Multi-Source Data Training
Combines multiple high-quality datasets for training to enhance model generalization
Model Capabilities
Image-to-Text
Automatic Image Caption Generation
Visual Content Understanding
Use Cases
Assistive Technology
Visual Impairment Assistance
Generates image descriptions for visually impaired users
Content Management
Automatic Image Tagging
Generates descriptive tags for image libraries or social media pictures
Featured Recommended AI Models
Š 2025AIbase