vit-swin-base-224-gpt2开源图像描述生成模型 - 精准为图片配上生动文案

首页

Vit Swin Base 224 Gpt2 Image Captioning

由 Abdou 开发

基于VisionEncoderDecoder架构的图像描述生成模型，使用Swin Transformer作为视觉编码器和GPT-2作为解码器，在COCO2014数据集上微调

图像生成文本

Transformers

英语开源协议:MIT #图像描述生成 #Swin-GPT2架构 #COCO微调

下载量 321

发布时间 : 2/5/2023

模型简介

该模型用于自动生成图像的英文描述，结合了视觉编码和文本生成能力

模型特点

混合架构

结合Swin Transformer的视觉编码能力和GPT-2的文本生成能力

高效训练

在COCO数据集60%的数据上微调，训练时间仅5小时(A100 GPU)

多指标优化

同时优化ROUGE和BLEU等多种文本生成指标

模型能力

图像理解

英文描述生成

自然语言生成

使用案例

辅助技术

视障人士辅助

为视障用户自动生成图像描述

内容管理

自动图像标注

为图像库自动生成描述性标签

🚀 视觉变换器-滑动窗口变换器基础版224-GPT2图像描述生成模型

本模型是基于VisionEncoderDecoder架构，在COCO2014数据集的60%数据上进行微调的模型。它在测试集上取得了以下效果：

损失值：0.7989
Rouge1：53.1153
Rouge2：24.2307
Rougel：51.5002
Rougelsum：51.4983
Bleu：17.7765

🚀 快速开始

本模型可用于图像描述生成任务。

✨ 主要特性

本模型使用microsoft/swin-base-patch4-window7-224-in22k作为视觉编码器，gpt2作为解码器。
仅可用于图像描述生成任务。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

你可以使用简单的管道API：

from transformers import pipeline

image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
# infer the caption
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")

高级用法

或者为了获得更多的灵活性，你可以手动初始化所有组件：

from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests

# a function to determine whether a string is a URL or not
def is_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False
    
# a function to load an image
def load_image(image_path):
    if is_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

# a function to perform inference
def get_caption(model, image_processor, tokenizer, image_path):
    image = load_image(image_path)
    # preprocess the image
    img = image_processor(image, return_tensors="pt").to(device)
    # generate the caption (using greedy decoding by default)
    output = model.generate(**img)
    # decode the output
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

device = "cuda" if torch.cuda.is_available() else "cpu"
# load the fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")

# target image
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
# get the caption
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")

输出示例：

Two cows laying in a field with a sky background.

📚 详细文档

你可以查看此指南，了解该模型的微调方法。

训练超参数

训练过程中使用了以下超参数：

学习率：5e-05
训练批次大小：64
评估批次大小：64
随机种子：42
优化器：Adam（β1=0.9，β2=0.999，ε=1e-08）
学习率调度器类型：线性
训练轮数：2

训练结果

训练损失	轮数	步数	验证损失	Rouge1	Rouge2	Rougel	Rougelsum	Bleu	生成长度
1.0018	0.38	2000	0.8860	38.6537	13.8145	35.3932	35.3935	8.2448	11.2946
0.8827	0.75	4000	0.8395	40.0458	14.8829	36.5321	36.5366	9.1169	11.2946
0.8378	1.13	6000	0.8140	41.2736	15.9576	37.5504	37.5512	9.871	11.2946
0.7913	1.51	8000	0.8012	41.6642	16.1987	37.8786	37.8891	10.0786	11.2946
0.7794	1.89	10000	0.7933	41.9119	16.3738	38.1062	38.1292	10.288	11.2946