V

Vit Roberta Fa Image Captioning Flickr30k

Developed by hezarai
A Persian image captioning model based on ViT+RoBERTa architecture, specifically designed to generate Persian text descriptions from images
Downloads 85
Release Time : 9/29/2023

Model Overview

This model combines Vision Transformer (ViT) and Persian RoBERTa to understand image content and generate accurate Persian descriptions. Primarily used for image understanding and caption generation tasks in Persian environments.

Model Features

Persian-Specific
Specially optimized for Persian image captioning, filling the gap in Persian visual-language models
Dual-Modal Architecture
Combines the strengths of Vision Transformer (ViT) and Text Transformer (RoBERTa) for efficient image-to-text conversion
Fine-Tuned Pretrained Models
Encoder and decoder are fine-tuned based on pretrained ViT and RoBERTa models, improving overall performance

Model Capabilities

Image content understanding
Persian text generation
Image-to-text conversion

Use Cases

Assistive Technology
Visual Assistance
Provides Persian audio descriptions of image content for visually impaired individuals
Helps visually impaired users understand image content
Content Creation
Social Media Automation
Automatically generates Persian captions for social media images
Simplifies content publishing workflow
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase