Swin-DistilBERTimbau Open-Source Model - Free Generation of Image Descriptions in Brazilian Portuguese

Home

Swin Distilbertimbau

Developed by laicsiifes

Brazilian Portuguese image captioning model based on Swin Transformer and DistilBERTimbau

Image-to-Text

Transformers

OtherOpen Source License:MIT #Portuguese image captioning #Visual encoder-decoder #Swin-Transformer

Downloads 18

Release Time : 9/1/2024

Model Overview

This model is a visual encoder-decoder specifically designed for generating Brazilian Portuguese image captions. It combines Swin Transformer as the visual encoder and DistilBERTimbau as the text decoder.

Model Features

Efficient dual-model architecture

Combines Swin Transformer's visual encoding capabilities with DistilBERTimbau's text generation capabilities

Portuguese language support

Specially optimized for Brazilian Portuguese image caption generation

High performance

Outperforms on the Flickr30K Portuguese dataset with leading metrics

Model Capabilities

Image understanding

Portuguese text generation

Image-to-text conversion

Use Cases

Content generation

Social media image captioning

Automatically generates Portuguese captions for images on social media platforms

Produces natural and fluent Portuguese image captions

Assistive technology

Provides text descriptions of images for visually impaired users

Helps visually impaired users understand image content

Multilingual applications

Portuguese content creation

Automatically generates image-related content for Portuguese-speaking markets

Improves efficiency in Portuguese content creation

🚀 Swin-DistilBERTimbau for Brazilian Portuguese Image Captioning

This project presents the Swin - DistilBERTimbau model, specifically trained for image captioning on the Flickr30K Portuguese dataset (a translated version using Google Translator API). It operates at a resolution of 224x224 with a maximum sequence length of 512 tokens, offering an effective solution for generating captions in Brazilian Portuguese.

✨ Features

Model Architecture: The Swin - DistilBERTimbau is a Vision Encoder Decoder model. It utilizes the checkpoints of the Swin Transformer as the encoder and the checkpoints of the DistilBERTimbau as the decoder. The encoder checkpoints are from the Swin Transformer version pre - trained on ImageNet - 1k at a resolution of 224x224.
Training and Evaluation Code: The code used for training and evaluation is publicly available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin - DistilBERTimbau was trained alongside its companion model Swin - GPorTuguese - 2.
Performance Comparison: Other evaluated models, such as DeiT - BERTimbau, DeiT - DistilBERTimbau, DeiT - GPorTuguese - 2, Swin - BERTimbau, ViT - BERTimbau, ViT - DistilBERTimbau, and ViT - GPorTuguese - 2, did not perform as well as Swin - DistilBERTimbau and Swin - GPorTuguese - 2.

📦 Installation

This README does not provide specific installation steps.

💻 Usage Examples

Basic Usage

import requests
from PIL import Image

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel

# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbertimbau")
tokenizer = AutoTokenizer.from_pretrained("laicsiifes/swin-distilbertimbau")
image_processor = AutoImageProcessor.from_pretrained("laicsiifes/swin-distilbertimbau")

# preprocess an image
url = "http://images.cocodataset.org/val2014/COCO_val2014_000000458153.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Advanced Usage

import matplotlib.pyplot as plt

# plot image with caption
plt.imshow(image)
plt.axis("off")
plt.title(generated_text)
plt.show()

image/png

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Datasets	laicsiifes/flickr30k-pt-br
Language	Portuguese
Metrics	bleu, rouge, meteor, bertscore
Base Model	adalbertojunior/distilbert-portuguese-cased
Pipeline Tag	image - to - text
Model Name	Swin - DistilBERTimbau

Evaluation Results

The evaluation metrics CIDEr - D, BLEU@4, ROUGE - L, METEOR and BERTScore (using BERTimbau) are abbreviated as C, B@4, RL, M and BS, respectively.

Model	Dataset	Eval. Split	C	B@4	RL	M	BS
Swin - DistilBERTimbau	Flickr30K Portuguese	test	66.73	24.65	39.98	44.71	72.30
Swin - GPorTuguese - 2	Flickr30K Portuguese	test	64.71	23.15	39.39	44.36	71.70

🔧 Technical Details

The Swin - DistilBERTimbau model combines the Swin Transformer encoder and the DistilBERTimbau decoder. The encoder is pre - trained on ImageNet - 1k at a resolution of 224x224, which helps in extracting meaningful visual features from images. The decoder, based on DistilBERTimbau, is then used to generate captions in Brazilian Portuguese. The model was trained on the Flickr30K Portuguese dataset, which is a translated version of the original Flickr30K dataset.

📄 License

This project is licensed under the MIT license.

📋 BibTeX entry and citation info

@inproceedings{bromonschenkel2024comparative,
  title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
  author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
  booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご