๐ BLIP-base fine-tuned for Image Captioning on High-Level descriptions of Actions
This project presents a fine-tuned BLIP base model on the HL dataset for action generation of images. It aims to generate high - level descriptions of actions in images.
โจ Features
- Fine - tuned on HL dataset: Trained specifically for action generation in images using the HL dataset.
- Optimized training: Utilizes Adam optimizer, half - precision (fp16), and a specific learning rate for 6 epochs.
- Multiple evaluation metrics: Evaluated using Cider, SacreBLEU, and Rouge - L.
๐ฆ Installation
Since the library used is transformers
, you can install it using the following command:
pip install transformers
๐ป Usage Examples
Basic Usage
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions")
model = BlipForConditionalGeneration.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions").to("cuda")
img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
do_sample=True,
top_k=120,
top_p=0.9,
early_stopping=True,
num_return_sequences=1)
processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> "she is holding an umbrella."
๐ง Technical Details
Model fine-tuning ๐๏ธโ
- Epochs: The model was trained for 6 epochs.
- Learning rate: A learning rate of 5eโ5 was used.
- Optimizer: The Adam optimizer was employed.
- Precision: Half - precision (fp16) was used during training.
Test set metrics ๐งพ
Property |
Details |
Cider |
123.07 |
SacreBLEU |
17.16 |
Rouge - L |
32.16 |
๐ License
This project is licensed under the Apache 2.0 license.
๐ Documentation
BibTex and citation info
@inproceedings{cafagna2023hl,
title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
year={2023}
}