D

Distilvit

Developed by Mozilla
A vision-language model based on VIT image encoder and distilled GPT-2 text decoder for image caption generation tasks
Downloads 290
Release Time : 3/18/2024

Model Overview

This is an ongoing development vision-language model capable of converting images into descriptive text. The model combines visual encoding and text decoding capabilities, suitable for scenarios like automatic image annotation and assistive technologies.

Model Features

Distilled Model Architecture
Uses distilled GPT-2 as the text decoder, reducing model complexity while maintaining performance
Debiased Training Data
Utilizes processed COCO and Flickr30k datasets to reduce bias in model outputs
Multi-Source Data Training
Combines multiple high-quality datasets for training to enhance model generalization

Model Capabilities

Image-to-Text
Automatic Image Caption Generation
Visual Content Understanding

Use Cases

Assistive Technology
Visual Impairment Assistance
Generates image descriptions for visually impaired users
Content Management
Automatic Image Tagging
Generates descriptive tags for image libraries or social media pictures
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase