vit-gpt2-image-captioning Open-source Image Captioning Model - Free to Generate Natural Language Descriptions for Images

Vit Gpt2 Image Captioning

Developed by aryan083

This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.

Image-to-Text

PyTorch

Open Source License:Apache-2.0 #Image to Text #Visual Encoding Decoding #Multimodal Generation

Downloads 31

Release Time : 3/20/2025

Model Overview

The model combines a visual encoder (ViT) and a text decoder (GPT2), enabling the conversion of image content into natural language descriptions. Primarily used for automatically generating textual descriptions of images.

Model Features

Vision-Language Joint Modeling

Combines a visual Transformer encoder and GPT2 text decoder to achieve image-to-text conversion.

End-to-End Training

The entire model is trained end-to-end, optimizing the joint task of image understanding and text generation.

Multi-Scenario Applicability

Capable of processing images from various scenarios, including natural scenes and human activities.

Model Capabilities

Image Understanding

Natural Language Generation

Image to Text

Automatic Image Tagging

Use Cases

Content Generation

Automatic Social Media Image Tagging

Automatically generates descriptive text for images uploaded to social media.

Produces natural language descriptions that match the image content.

Accessibility Technology Support

Provides audio descriptions of image content for visually impaired individuals.

Converts visual information into audible text descriptions.

Digital Asset Management

Automatic Image Library Tagging

Automatically generates search tags and descriptions for large image libraries.

Improves image retrieval efficiency and accuracy.

🚀 nlpconnect/vit-gpt2-image-captioning

This is an image captioning model that can generate text descriptions for images, trained based on the Transformer architecture, offering high - quality image - to - text conversion capabilities.

🚀 Quick Start

This image captioning model is trained by @ydshieh in flax . It's the PyTorch version of this.

📚 Documentation

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

💻 Usage Examples

Basic Usage

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds

predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']

Advanced Usage

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")

# [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]

📄 License

This project is licensed under the Apache 2.0 license.

📞 Contact

For any help, you can reach out via the following channels:

https://huggingface.co/ankur310794
https://twitter.com/ankur310794
http://github.com/ankur3107
https://www.linkedin.com/in/ankur310794

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご