🚀 vit-swin-base-224-gpt2-image-captioning
This model is a fine - tuned VisionEncoderDecoder model. It is trained on 60% of the COCO2014 dataset and can be used for image captioning.
✨ Features
- Trained on a large - scale image dataset COCO2014.
- Achieved good performance on testing set metrics such as Rouge and Bleu.
- Can be used easily through simple pipeline API or with more flexibility by initializing components.
📦 Installation
No specific installation steps are provided in the original README.
💻 Usage Examples
Basic Usage
You can use the simple pipeline API:
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")
Advanced Usage
Or initialize everything for more flexibility:
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests
def is_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
def load_image(image_path):
if is_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
def get_caption(model, image_processor, tokenizer, image_path):
image = load_image(image_path)
img = image_processor(image, return_tensors="pt").to(device)
output = model.generate(**img)
caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
return caption
device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")
Output:
Two cows laying in a field with a sky background.
📚 Documentation
Model description
The model was initialized on microsoft/swin-base-patch4-window7-224-in22k as the vision encoder, and gpt2 as the decoder.
Intended uses & limitations
You can use this model for image captioning only.
Training procedure
You can check this guide to learn how this model was fine - tuned.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- num_epochs: 2
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Rouge1 |
Rouge2 |
Rougel |
Rougelsum |
Bleu |
Gen Len |
1.0018 |
0.38 |
2000 |
0.8860 |
38.6537 |
13.8145 |
35.3932 |
35.3935 |
8.2448 |
11.2946 |
0.8827 |
0.75 |
4000 |
0.8395 |
40.0458 |
14.8829 |
36.5321 |
36.5366 |
9.1169 |
11.2946 |
0.8378 |
1.13 |
6000 |
0.8140 |
41.2736 |
15.9576 |
37.5504 |
37.5512 |
9.871 |
11.2946 |
0.7913 |
1.51 |
8000 |
0.8012 |
41.6642 |
16.1987 |
37.8786 |
37.8891 |
10.0786 |
11.2946 |
0.7794 |
1.89 |
10000 |
0.7933 |
41.9119 |
16.3738 |
38.1062 |
38.1292 |
10.288 |
11.2946 |
Total training time: ~5 hours on NVIDIA A100 GPU.
Framework versions
- Transformers 4.26.0
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2
📄 License
This project is licensed under the MIT license.