Vit Roberta Fa Image Captioning Flickr30k
V
Vit Roberta Fa Image Captioning Flickr30k
Developed by hezarai
A Persian image captioning model based on ViT+RoBERTa architecture, specifically designed to generate Persian text descriptions from images
Downloads 85
Release Time : 9/29/2023
Model Overview
This model combines Vision Transformer (ViT) and Persian RoBERTa to understand image content and generate accurate Persian descriptions. Primarily used for image understanding and caption generation tasks in Persian environments.
Model Features
Persian-Specific
Specially optimized for Persian image captioning, filling the gap in Persian visual-language models
Dual-Modal Architecture
Combines the strengths of Vision Transformer (ViT) and Text Transformer (RoBERTa) for efficient image-to-text conversion
Fine-Tuned Pretrained Models
Encoder and decoder are fine-tuned based on pretrained ViT and RoBERTa models, improving overall performance
Model Capabilities
Image content understanding
Persian text generation
Image-to-text conversion
Use Cases
Assistive Technology
Visual Assistance
Provides Persian audio descriptions of image content for visually impaired individuals
Helps visually impaired users understand image content
Content Creation
Social Media Automation
Automatically generates Persian captions for social media images
Simplifies content publishing workflow
Featured Recommended AI Models