The open-source image captioning model, blip-base-captioning-ft-hl-actions, accurately describes high-level actions in images.

Blip Base Captioning Ft Hl Actions

Developed by michelecafagna26

This model is a fine-tuned image-to-text generation model based on the BLIP architecture, specifically designed to generate captions describing high-level actions in images.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Image Action Description #High-level Semantic Generation #Multimodal Understanding

Downloads 16

Release Time : 7/22/2023

Model Overview

The model was fine-tuned on the HL dataset, focusing on generating natural language text that describes actions from images.

Model Features

High-level Action Description

Specifically generates descriptive text for high-level actions in images.

Fine-tuning Optimization

Fine-tuned for 6 epochs on the HL dataset to enhance action description capabilities.

Half-precision Training

Trained using fp16 half-precision to improve training efficiency.

Model Capabilities

Image Understanding

Action Description Generation

Natural Language Generation

Use Cases

Image Captioning

Action Scene Description

Generates descriptive text for images containing human actions.

Produces natural language descriptions such as 'She is holding an umbrella.'

🚀 BLIP-base fine-tuned for Image Captioning on High-Level descriptions of Actions

This project presents a fine-tuned BLIP base model on the HL dataset for action generation of images. It aims to generate high - level descriptions of actions in images.

✨ Features

Fine - tuned on HL dataset: Trained specifically for action generation in images using the HL dataset.
Optimized training: Utilizes Adam optimizer, half - precision (fp16), and a specific learning rate for 6 epochs.
Multiple evaluation metrics: Evaluated using Cider, SacreBLEU, and Rouge - L.

📦 Installation

Since the library used is transformers, you can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions")
model = BlipForConditionalGeneration.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions").to("cuda")

img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
            do_sample=True,
            top_k=120,
            top_p=0.9,
            early_stopping=True,
            num_return_sequences=1)

processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> "she is holding an umbrella."

🔧 Technical Details

Model fine-tuning 🏋️‍

Epochs: The model was trained for 6 epochs.
Learning rate: A learning rate of 5e−5 was used.
Optimizer: The Adam optimizer was employed.
Precision: Half - precision (fp16) was used during training.

Test set metrics 🧾

Property	Details
Cider	123.07
SacreBLEU	17.16
Rouge - L	32.16

📄 License

This project is licensed under the Apache 2.0 license.

📚 Documentation

BibTex and citation info

@inproceedings{cafagna2023hl,
  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご