vit_large_patch14_reg4_dinov2.lvd142m オープンソース画像特徴モデル

ホーム

Vit Large Patch14 Reg4 Dinov2.lvd142m

timmによって開発

レジスタ付き視覚トランスフォーマー（ViT）画像特徴モデル、自己教師ありのDINOv2手法でLVD-142Mデータセット上で事前学習済み。

画像分類

Transformers

オープンソースライセンス:Apache-2.0 #自己教師あり視覚特徴 #レジスタ強化ViT #大規模画像処理

ダウンロード数 119.48k

リリース時間 : 10/30/2023

モデル概要

このモデルは視覚トランスフォーマー（ViT）アーキテクチャの画像特徴抽出モデルで、主に画像分類と特徴抽出タスクに使用されます。自己教師あり学習で大規模データセット上で事前学習されており、高品質な画像特徴を抽出できます。

モデル特徴

レジスタ強化

モデルはレジスタ機構を採用しており、視覚トランスフォーマーの性能を向上させ、特に画像背景や無関係な情報を処理する際に優れた性能を発揮します。

自己教師あり事前学習

DINOv2自己教師あり学習手法を使用し、LVD-142Mデータセットで事前学習されており、人手のアノテーションなしで強力な視覚特徴を学習できます。

大サイズ入力対応

518x518ピクセルの大サイズ画像入力をサポートし、より豊富な視覚的詳細を捉えることができます。

モデル能力

画像特徴抽出

画像分類

視覚表現学習

使用事例

コンピュータビジョン

画像分類

物体認識、シーン分類などの一般的な画像分類タスクに使用可能です。

特徴抽出

他の視覚タスクのバックボーンネットワークとして使用でき、高品質な画像特徴表現を提供します。

🚀 vit_large_patch14_reg4_dinov2.lvd142m

このモデルは、Vision Transformer (ViT) をベースにした画像特徴抽出モデルで、self-supervised DINOv2 手法を用いて LVD-142M データセットで事前学習されています。

🚀 クイックスタート

このモデルは、画像分類や画像埋め込みのタスクに使用できます。以下のセクションで具体的な使用方法を説明します。

✨ 主な機能

画像分類と画像特徴抽出に適しています。
self-supervised DINOv2 手法で事前学習されているため、汎用性が高い特徴を抽出できます。

📦 インストール

このモデルを使用するには、timm ライブラリが必要です。以下のコマンドでインストールできます。

pip install timm

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_large_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_large_patch14_reg4_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	画像分類 / 特徴抽出バックボーン
パラメータ数 (M)	304.4
GMACs	416.1
活性化関数の出力数 (M)	305.3
画像サイズ	518 x 518
論文	Vision Transformers Need Registers DINOv2: Learning Robust Visual Features without Supervision An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
オリジナルリポジトリ	https://github.com/facebookresearch/dinov2
事前学習データセット	LVD-142M

モデル比較

このモデルのデータセットと実行時間のメトリクスについては、timm のモデル結果を参照してください。

📄 ライセンス

このモデルは Apache-2.0 ライセンスの下で提供されています。

引用

@article{darcet2023vision,
  title={Vision Transformers Need Registers},
  author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv preprint arXiv:2309.16588},
  year={2023}
}

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}