vit-swin-base-224-gpt2オープンソース画像説明生成モデル - 画像に的確に生き生きとしたキャプションを付ける

Vit Swin Base 224 Gpt2 Image Captioning

Abdouによって開発

VisionEncoderDecoderアーキテクチャに基づく画像キャプション生成モデルで、Swin Transformerを視覚エンコーダー、GPT-2をデコーダーとして使用し、COCO2014データセットでファインチューニングされています

画像生成テキスト

Transformers

英語オープンソースライセンス:MIT #画像キャプション生成 #Swin-GPT2アーキテクチャ #COCOファインチューニング

ダウンロード数 321

リリース時間 : 2/5/2023

モデル概要

このモデルは画像の英語説明を自動生成するために使用され、視覚エンコーディングとテキスト生成能力を組み合わせています

モデル特徴

ハイブリッドアーキテクチャ

Swin Transformerの視覚エンコーディング能力とGPT-2のテキスト生成能力を組み合わせています

効率的なトレーニング

COCOデータセットの60%のデータでファインチューニングされ、トレーニング時間はわずか5時間(A100 GPU)

複数指標最適化

ROUGEやBLEUなど複数のテキスト生成指標を同時に最適化しています

モデル能力

画像理解

英語説明生成

自然言語生成

使用事例

支援技術

視覚障害者支援

視覚障害ユーザー向けに画像説明を自動生成します

コンテンツ管理

自動画像タグ付け

画像ライブラリに説明タグを自動生成します

🚀 vit-swin-base-224-gpt2-image-captioning

このモデルは、COCO2014 データセットの60%で微調整された VisionEncoderDecoder モデルです。テストセットでは以下の結果を達成しています。

損失: 0.7989
Rouge1: 53.1153
Rouge2: 24.2307
Rougel: 51.5002
Rougelsum: 51.4983
Bleu: 17.7765

✨ 主な機能

📚 ドキュメント

モデルの説明

このモデルは、ビジョンエンコーダとして microsoft/swin-base-patch4-window7-224-in22k を、デコーダとして gpt2 を使用して初期化されています。

想定される用途と制限

このモデルは画像キャプショニングにのみ使用できます。

💻 使用例

基本的な使用法

簡単なパイプラインAPIを使用することができます。

from transformers import pipeline

image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
# infer the caption
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")

高度な使用法

より柔軟性を持たせるためにすべてを初期化することもできます。

from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests

# a function to determine whether a string is a URL or not
def is_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False
    
# a function to load an image
def load_image(image_path):
    if is_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

# a function to perform inference
def get_caption(model, image_processor, tokenizer, image_path):
    image = load_image(image_path)
    # preprocess the image
    img = image_processor(image, return_tensors="pt").to(device)
    # generate the caption (using greedy decoding by default)
    output = model.generate(**img)
    # decode the output
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

device = "cuda" if torch.cuda.is_available() else "cpu"
# load the fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")

# target image
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
# get the caption
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")

出力:

Two cows laying in a field with a sky background.

🔧 技術詳細

トレーニング手順

このモデルがどのように微調整されたかを学ぶには、このガイドを参照してください。

トレーニングハイパーパラメータ

トレーニング中に以下のハイパーパラメータが使用されました。

学習率: 5e-05
トレーニングバッチサイズ: 64
評価バッチサイズ: 64
シード: 42
オプティマイザ: Adam (betas=(0.9,0.999), epsilon=1e-08)
学習率スケジューラの種類: 線形
エポック数: 2

トレーニング結果

トレーニング損失	エポック	ステップ	検証損失	Rouge1	Rouge2	Rougel	Rougelsum	Bleu	生成長
1.0018	0.38	2000	0.8860	38.6537	13.8145	35.3932	35.3935	8.2448	11.2946
0.8827	0.75	4000	0.8395	40.0458	14.8829	36.5321	36.5366	9.1169	11.2946
0.8378	1.13	6000	0.8140	41.2736	15.9576	37.5504	37.5512	9.871	11.2946
0.7913	1.51	8000	0.8012	41.6642	16.1987	37.8786	37.8891	10.0786	11.2946
0.7794	1.89	10000	0.7933	41.9119	16.3738	38.1062	38.1292	10.288	11.2946