vit_small_patch14_reg4_dinov2.lvd142mオープンソース画像特徴モデル

ホーム

Vit Small Patch14 Reg4 Dinov2.lvd142m

timmによって開発

レジスタを備えた視覚Transformer（ViT）画像特徴モデルで、自己教師ありDINOv2手法を用いてLVD-142Mデータセットで事前学習されています。

画像分類

Transformers

オープンソースライセンス:Apache-2.0 #自己教師あり視覚特徴 #レジスタ強化ViT #518大規模入力

ダウンロード数 15.98k

リリース時間 : 10/30/2023

モデル概要

このモデルは主に画像分類と特徴抽出に使用され、視覚Transformerアーキテクチャを採用し、レジスタ機構を組み合わせて性能を向上させています。

モデル特徴

レジスタ機構

レジスタ機構を採用して視覚Transformerの性能を向上させ、従来のViTモデルの問題を解決します。

自己教師あり事前学習

DINOv2自己教師あり学習手法を使用してLVD-142Mデータセットで事前学習を行い、人手のアノテーションを必要としません。

効率的な特徴抽出

モデルのパラメータ数が比較的少ない（22.1M）ですが、画像特徴を効率的に抽出でき、様々な下流タスクに適用可能です。

モデル能力

画像分類

画像特徴抽出

視覚表現学習

使用事例

コンピュータビジョン

画像分類

物体やシーンなどの一般的な画像分類タスクに使用できます。

特徴抽出

物体検出や画像検索などの下流タスクに使用するための画像特徴を抽出します。

🚀 vit_small_patch14_reg4_dinov2.lvd142m

レジスタを備えたVision Transformer (ViT) 画像特徴モデル。自己教師付きDINOv2手法でLVD - 142Mデータセットで事前学習されています。

🚀 クイックスタート

このモデルは、画像分類や画像埋め込みのタスクに使用できます。以下のセクションで具体的な使用方法を説明します。

✨ 主な機能

レジスタを備えたVision Transformerアーキテクチャ。
自己教師付き学習手法DINOv2を用いてLVD - 142Mデータセットで事前学習。
画像分類と画像埋め込みのタスクに適用可能。

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	画像分類 / 特徴バックボーン
パラメータ数 (M)	22.1
GMACs	29.6
アクティベーション数 (M)	57.5
画像サイズ	518 x 518
関連論文	- Vision Transformers Need Registers - DINOv2: Learning Robust Visual Features without Supervision - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
オリジナルリポジトリ	https://github.com/facebookresearch/dinov2
事前学習データセット	LVD - 142M

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_small_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_small_patch14_reg4_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 384) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

🔧 技術詳細

このモデルのデータセットと実行時のメトリクスについては、timmのモデル結果を参照してください。

📄 ライセンス

このモデルはApache - 2.0ライセンスの下で提供されています。

引用

@article{darcet2023vision,
  title={Vision Transformers Need Registers},
  author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv preprint arXiv:2309.16588},
  year={2023}
}

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}