vit_large_patch16_224.maeオープンソース画像特徴抽出モデル

ホーム

Vit Large Patch16 224.mae

timmによって開発

視覚トランスフォーマー(ViT)ベースの大規模画像特徴抽出モデルで、自己教師ありマスクオートエンコーダ(MAE)手法を用いてImageNet-1kデータセットで事前学習済み

画像分類

Transformers

#自己教師あり視覚特徴 #高パラメータViT #画像意味符号化

ダウンロード数 960

リリース時間 : 5/9/2023

モデル概要

このモデルは視覚トランスフォーマーアーキテクチャの大規模画像特徴抽出モデルで、主に画像分類と特徴抽出タスクに使用されます。マスクオートエンコーダ(MAE)による自己教師あり学習手法でImageNet-1kデータセットで事前学習されています。

モデル特徴

自己教師あり事前学習

マスクオートエンコーダ(MAE)手法による自己教師あり事前学習を採用し、大量の注釈データなしで有効な特徴表現を学習可能

大規模視覚トランスフォーマー

ViT-Largeアーキテクチャベースで303.3Mパラメータを持ち、豊富な視覚特徴を捉えることが可能

効率的な特徴抽出

画像のグローバル特徴やローカルパッチ特徴の抽出をサポートし、様々な下流視覚タスクに適用可能

モデル能力

画像分類

画像特徴抽出

視覚表現学習

使用事例

コンピュータビジョン

画像分類

画像分類に使用可能で、1000クラスのImageNet分類タスクをサポート

特徴抽出

物体検出や画像セグメンテーションなどの下流視覚タスク用の特徴抽出器として使用可能

🚀 vit_large_patch16_224.mae のモデルカード

Vision Transformer (ViT) の画像特徴抽出モデルです。Self-Supervised Masked Autoencoder (MAE) 手法を用いて ImageNet-1k データセットで事前学習されています。

🚀 クイックスタート

このモデルは、画像分類や画像埋め込みのタスクに使用できます。以下に具体的な使用例を示します。

✨ 主な機能

画像分類タスクに適用可能
画像埋め込みを生成することができる

📦 インストール

このモデルは timm ライブラリを使用しています。timm をインストールすることで利用できます。

pip install timm

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_large_patch16_224.mae', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_large_patch16_224.mae',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 197, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	画像分類 / 特徴抽出バックボーン
パラメータ数 (M)	303.3
GMACs	61.6
活性化関数の出力数 (M)	63.5
画像サイズ	224 x 224
関連論文	- Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
事前学習データセット	ImageNet-1k
オリジナルリポジトリ	https://github.com/facebookresearch/mae

モデル比較

timm のモデル結果でこのモデルのデータセットと実行時間のメトリクスを確認できます。

📄 ライセンス

このモデルは CC BY-NC 4.0 ライセンスの下で提供されています。

📚 引用

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}