FuseCap Open-Source Image Description Framework - Use large models for free to generate semantically rich image descriptions

Fusecap Image Captioning

Developed by noamrot

FuseCap is a framework specifically designed for generating semantically rich image captions, leveraging large language models to produce fused image descriptions.

Image-to-Text

Transformers

Open Source License:MIT #Image Caption Generation #Semantically Rich Descriptions #LLM-Enhanced

Downloads 2,771

Release Time : 5/31/2023

Model Overview

FuseCap is an image-to-text model aimed at generating semantically rich image descriptions. By integrating the capabilities of large language models, it provides more detailed and accurate image captions.

Model Features

Semantically Rich Image Descriptions

Leverages large language models to generate more detailed and accurate image descriptions.

Fused Descriptions

Generates more comprehensive image descriptions by fusing multiple description sources.

BLIP-Based Architecture

Utilizes the BLIP architecture for training and inference, ensuring model efficiency and accuracy.

Model Capabilities

Image Caption Generation

Semantically Rich Text Output

Multimodal Fusion

Use Cases

Image Understanding

Automatic Image Tagging

Generates detailed descriptions for images, used for automatic tagging and classification.

Produces semantically rich descriptions, improving tagging quality.

Assisting Visually Impaired Individuals

Provides detailed image descriptions for visually impaired individuals to help them understand image content.

Delivers more accurate and detailed image descriptions, enhancing user experience.

🚀 FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

FuseCap is a framework crafted to generate semantically rich image captions, offering a novel approach to image captioning.

🚀 Quick Start

Our BLIP-based model can be run using the following code:

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = BlipProcessor.from_pretrained("noamrot/FuseCap")
model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device)

img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

text = "a picture of "
inputs = processor(raw_image, text, return_tensors="pt").to(device)

out = model.generate(**inputs, num_beams = 3)
print(processor.decode(out[0], skip_special_tokens=True))

✨ Features

Semantically Rich Captions: Generate high - quality, semantically rich image captions.
BLIP - Based Model: Utilize a powerful BLIP - based model for caption generation.

📚 Documentation

Resources

💻 Project Page: For more details, visit the official project page.
📝 Read the Paper: You can find the paper here.
🚀 Demo: Try out our BLIP - based model demo trained using FuseCap.
📂 Code Repository: The code for FuseCap can be found in the GitHub repository.
🗃️ Datasets: The fused captions datasets can be accessed from here.

Upcoming Updates

The official codebase, datasets and trained models for this project will be released soon.

📄 License

This project is licensed under the MIT license.

📚 BibTeX

@inproceedings{rotstein2024fusecap,
  title={Fusecap: Leveraging large language models for enriched fused image captions},
  author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={5689--5700},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご