ViT_GPT2を用いたオープンソースの画像キャプション生成モデル

ホーム

Image Caption Using ViT GPT2

Ayansk11によって開発

これはVision Transformer(ViT)とGPT2アーキテクチャに基づく画像説明生成モデルで、入力画像に対して自然言語の説明を生成できます。

画像生成テキスト

Transformers

オープンソースライセンス:Apache-2.0 #画像からテキストへ #視覚エンコーダーデコーダー #マルチモーダル生成

ダウンロード数 15

リリース時間 : 10/20/2023

モデル概要

このモデルは視覚エンコーダーとテキストデコーダーを組み合わせ、画像からテキストへの変換を実現し、自動画像タグ付けや視覚障害者支援などのシナリオに適しています。

モデル特徴

視覚-言語統合モデリング

視覚Transformerと言語モデルを組み合わせ、クロスモーダルな理解と生成を実現

エンドツーエンド学習

モデル全体をエンドツーエンドで学習可能、画像からテキストへの変換効果を最適化

多様なシーンに対応

様々なシーンの画像説明生成タスクを処理可能

モデル能力

画像理解

自然言語生成

クロスモーダル変換

使用事例

支援技術

視覚障害者支援

視覚障害者のために周囲環境を説明

正確な環境説明を生成

コンテンツ管理

自動画像タグ付け

画像ライブラリに自動的に説明タグを生成

画像検索効率を向上

🚀 トランスフォーマーを用いた画像キャプショニングの解説

このプロジェクトは、トランスフォーマーを利用した画像キャプショニングを実現します。画像からテキストを生成することで、画像の内容を説明するキャプションを自動生成します。

🚀 クイックスタート

以下のサンプルコードを使って、画像キャプショニングを試すことができます。

💻 使用例

基本的な使用法

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("Ayansk11/Image_Caption_using_ViT_GPT2")
feature_extractor = ViTImageProcessor.from_pretrained("Ayansk11/Image_Caption_using_ViT_GPT2")
tokenizer = AutoTokenizer.from_pretrained("Ayansk11/Image_Caption_using_ViT_GPT2")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds


predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']

高度な使用法

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="Ayansk11/Image_Caption_using_ViT_GPT2")

image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")

# [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]