vit-gpt2-image-captioning Open-source Model - Freely Deployed to Generate Natural Language Descriptions for Images

Vit Gpt2 Image Captioning

Developed by baseplate

This is an image captioning model based on the Vision Encoder-Decoder architecture, capable of generating natural language descriptions for input images.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Image-to-Text #Vision-Language Model #Automatic Image Captioning

Downloads 55

Release Time : 4/5/2023

Model Overview

The model uses ViT as the image encoder and GPT-2 as the text decoder, enabling the conversion of visual information into natural language descriptions. It is primarily used for automatically generating titles or descriptions for images.

Model Features

Vision-Language Joint Model

Combines the capabilities of Vision Transformer and language models to achieve cross-modal understanding and generation.

End-to-End Training

The entire model can be trained end-to-end, optimizing the image-to-text conversion process.

Transformer-Based Architecture

Utilizes the self-attention mechanism of Transformers to effectively capture relationships between images and text.

Model Capabilities

Image Understanding

Natural Language Generation

Cross-Modal Conversion

Use Cases

Content Generation

Automatic Social Media Image Tagging

Automatically generates descriptive captions for images on social media platforms.

Improves content accessibility and searchability.

Assistive Technology

Provides audio descriptions of image content for visually impaired individuals.

Enhances accessibility of digital content.

Digital Asset Management

Automatic Image Library Tagging

Automatically generates metadata descriptions for large image libraries.

Improves image retrieval efficiency and management capabilities.

🚀 nlpconnect/vit-gpt2-image-captioning

This project offers an image captioning model. It can generate captions for images, bridging the gap between visual content and textual descriptions, which is highly useful in scenarios like image search and accessibility for the visually impaired.

🚀 Quick Start

This is an image captioning model trained by @ydshieh in flax . It's the PyTorch version of this.

📚 Documentation

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

💻 Usage Examples

Basic Usage

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds

predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']

Advanced Usage

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")

# [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]

📄 License

The project is licensed under the apache-2.0 license.

📞 Contact

For any help, you can reach out through the following channels:

https://huggingface.co/ankur310794
https://twitter.com/ankur310794
http://github.com/ankur3107
https://www.linkedin.com/in/ankur310794

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご