vit-gpt2-image-captioning Open-source Image Captioning Model - Generate Natural Language Descriptions for Images for Free

Vit Gpt2 Image Captioning

Developed by nlpconnect

This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Image to Text #Visual Encoding Decoding #Multi-scene Description

Downloads 939.88k

Release Time : 3/2/2022

Model Overview

The model combines a visual encoder (ViT) and a text decoder (GPT2) to convert image content into natural language descriptions. Suitable for automatic image annotation, assisting visually impaired individuals, and other scenarios.

Model Features

Vision-Language Joint Model

Combines a visual Transformer encoder and GPT2 text decoder to achieve image-to-text conversion.

Multi-scene Applicability

Capable of generating descriptions for various common scene images.

Pre-trained Model

Pre-trained on large-scale datasets and ready for direct inference.

Model Capabilities

Image Content Understanding

Natural Language Generation

Automatic Image Annotation

Use Cases

Assistive Technology

Visual Impairment Assistance

Describing image content for visually impaired individuals

Generates accurate descriptions to aid in understanding images.

Content Management

Automatic Image Tagging

Automatically generating descriptive labels for large volumes of images

Improves image retrieval and management efficiency.

🚀 nlpconnect/vit-gpt2-image-captioning

This is an image captioning model that can convert images into text descriptions, providing a convenient way to understand image content.

🚀 Quick Start

This is an image captioning model trained by @ydshieh in flax . It's the PyTorch version of this.

📚 Documentation

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

💻 Usage Examples

Basic Usage

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds

predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']

Advanced Usage

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")

# [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]

📄 License

This project is licensed under the Apache 2.0 license.

🔗 Contact for any help

https://huggingface.co/ankur310794
https://twitter.com/ankur310794
http://github.com/ankur3107
https://www.linkedin.com/in/ankur310794

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご