V

Vit Swin Base 224 Gpt2 Image Captioning

Developed by Abdou
An image caption generation model based on the VisionEncoderDecoder architecture, using Swin Transformer as the visual encoder and GPT-2 as the decoder, fine-tuned on the COCO2014 dataset
Downloads 321
Release Time : 2/5/2023

Model Overview

This model is used to automatically generate English descriptions of images, combining visual encoding and text generation capabilities

Model Features

Hybrid Architecture
Combines the visual encoding capability of Swin Transformer with the text generation capability of GPT-2
Efficient Training
Fine-tuned on 60% of the COCO dataset, with a training time of only 5 hours (A100 GPU)
Multi-metric Optimization
Simultaneously optimizes multiple text generation metrics such as ROUGE and BLEU

Model Capabilities

Image Understanding
English Description Generation
Natural Language Generation

Use Cases

Assistive Technology
Assistance for Visually Impaired
Automatically generates image descriptions for visually impaired users
Content Management
Automatic Image Tagging
Automatically generates descriptive tags for image libraries
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase