đ Swin-DistilBERTimbau for Brazilian Portuguese Image Captioning
This project presents the Swin - DistilBERTimbau model, specifically trained for image captioning on the Flickr30K Portuguese dataset (a translated version using Google Translator API). It operates at a resolution of 224x224 with a maximum sequence length of 512 tokens, offering an effective solution for generating captions in Brazilian Portuguese.
⨠Features
- Model Architecture: The Swin - DistilBERTimbau is a Vision Encoder Decoder model. It utilizes the checkpoints of the Swin Transformer as the encoder and the checkpoints of the DistilBERTimbau as the decoder. The encoder checkpoints are from the Swin Transformer version pre - trained on ImageNet - 1k at a resolution of 224x224.
- Training and Evaluation Code: The code used for training and evaluation is publicly available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin - DistilBERTimbau was trained alongside its companion model Swin - GPorTuguese - 2.
- Performance Comparison: Other evaluated models, such as DeiT - BERTimbau, DeiT - DistilBERTimbau, DeiT - GPorTuguese - 2, Swin - BERTimbau, ViT - BERTimbau, ViT - DistilBERTimbau, and ViT - GPorTuguese - 2, did not perform as well as Swin - DistilBERTimbau and Swin - GPorTuguese - 2.
đĻ Installation
This README does not provide specific installation steps.
đģ Usage Examples
Basic Usage
import requests
from PIL import Image
from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbertimbau")
tokenizer = AutoTokenizer.from_pretrained("laicsiifes/swin-distilbertimbau")
image_processor = AutoImageProcessor.from_pretrained("laicsiifes/swin-distilbertimbau")
url = "http://images.cocodataset.org/val2014/COCO_val2014_000000458153.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Advanced Usage
import matplotlib.pyplot as plt
plt.imshow(image)
plt.axis("off")
plt.title(generated_text)
plt.show()

đ Documentation
Model Information
Property |
Details |
Library Name |
transformers |
Datasets |
laicsiifes/flickr30k-pt-br |
Language |
Portuguese |
Metrics |
bleu, rouge, meteor, bertscore |
Base Model |
adalbertojunior/distilbert-portuguese-cased |
Pipeline Tag |
image - to - text |
Model Name |
Swin - DistilBERTimbau |
Evaluation Results
The evaluation metrics CIDEr - D, BLEU@4, ROUGE - L, METEOR and BERTScore (using BERTimbau) are abbreviated as C, B@4, RL, M and BS, respectively.
Model |
Dataset |
Eval. Split |
C |
B@4 |
RL |
M |
BS |
Swin - DistilBERTimbau |
Flickr30K Portuguese |
test |
66.73 |
24.65 |
39.98 |
44.71 |
72.30 |
Swin - GPorTuguese - 2 |
Flickr30K Portuguese |
test |
64.71 |
23.15 |
39.39 |
44.36 |
71.70 |
đ§ Technical Details
The Swin - DistilBERTimbau model combines the Swin Transformer encoder and the DistilBERTimbau decoder. The encoder is pre - trained on ImageNet - 1k at a resolution of 224x224, which helps in extracting meaningful visual features from images. The decoder, based on DistilBERTimbau, is then used to generate captions in Brazilian Portuguese. The model was trained on the Flickr30K Portuguese dataset, which is a translated version of the original Flickr30K dataset.
đ License
This project is licensed under the MIT license.
đ BibTeX entry and citation info
@inproceedings{bromonschenkel2024comparative,
title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
pages={1--6},
year={2024},
organization={IEEE}
}