🚀 视觉变换器-滑动窗口变换器基础版224-GPT2图像描述生成模型
本模型是基于VisionEncoderDecoder架构,在COCO2014数据集的60%数据上进行微调的模型。它在测试集上取得了以下效果:
- 损失值:0.7989
- Rouge1:53.1153
- Rouge2:24.2307
- Rougel:51.5002
- Rougelsum:51.4983
- Bleu:17.7765
🚀 快速开始
本模型可用于图像描述生成任务。
✨ 主要特性
📦 安装指南
文档未提及安装步骤,故跳过此章节。
💻 使用示例
基础用法
你可以使用简单的管道API:
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")
高级用法
或者为了获得更多的灵活性,你可以手动初始化所有组件:
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests
def is_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
def load_image(image_path):
if is_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
def get_caption(model, image_processor, tokenizer, image_path):
image = load_image(image_path)
img = image_processor(image, return_tensors="pt").to(device)
output = model.generate(**img)
caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
return caption
device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")
输出示例:
Two cows laying in a field with a sky background.
📚 详细文档
你可以查看此指南,了解该模型的微调方法。
训练超参数
训练过程中使用了以下超参数:
- 学习率:5e-05
- 训练批次大小:64
- 评估批次大小:64
- 随机种子:42
- 优化器:Adam(β1=0.9,β2=0.999,ε=1e-08)
- 学习率调度器类型:线性
- 训练轮数:2
训练结果
训练损失 |
轮数 |
步数 |
验证损失 |
Rouge1 |
Rouge2 |
Rougel |
Rougelsum |
Bleu |
生成长度 |
1.0018 |
0.38 |
2000 |
0.8860 |
38.6537 |
13.8145 |
35.3932 |
35.3935 |
8.2448 |
11.2946 |
0.8827 |
0.75 |
4000 |
0.8395 |
40.0458 |
14.8829 |
36.5321 |
36.5366 |
9.1169 |
11.2946 |
0.8378 |
1.13 |
6000 |
0.8140 |
41.2736 |
15.9576 |
37.5504 |
37.5512 |
9.871 |
11.2946 |
0.7913 |
1.51 |
8000 |
0.8012 |
41.6642 |
16.1987 |
37.8786 |
37.8891 |
10.0786 |
11.2946 |
0.7794 |
1.89 |
10000 |
0.7933 |
41.9119 |
16.3738 |
38.1062 |
38.1292 |
10.288 |
11.2946 |
总训练时间:在NVIDIA A100 GPU上约5小时。
框架版本
- Transformers 4.26.0
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2
📄 许可证
本模型使用MIT许可证。