vit_base_patch14_dinov2.lvd142mオープンソース画像特徴モデル

ホーム

Vit Base Patch14 Dinov2.lvd142m

timmによって開発

Vision Transformer（ViT）ベースの画像特徴モデル、自己教師ありDINOv2手法でLVD-142Mデータセット上で事前学習済み

画像分類

Transformers

オープンソースライセンス:Apache-2.0 #自己教師あり視覚特徴 #大規模画像処理 #DINOv2事前学習

ダウンロード数 50.71k

リリース時間 : 5/9/2023

モデル概要

このモデルは画像分類と特徴抽出のためのバックボーンネットワークで、Vision Transformerアーキテクチャを採用し、大規模データセット上での自己教師あり学習により事前学習されており、高品質な画像特徴表現を抽出可能です。

モデル特徴

自己教師あり事前学習

DINOv2自己教師あり学習手法を採用し、LVD-142Mデータセットで事前学習済み、人手によるアノテーションデータ不要

大サイズ画像処理

518×518ピクセルの大サイズ画像入力をサポートし、より豊富な視覚情報を捕捉可能

効率的な特徴抽出

モデル設計は計算効率を最適化し、GMACs演算量は151.7で、特徴抽出バックボーンネットワークとして適しています

モデル能力

画像特徴抽出

画像分類

視覚表現学習

使用事例

コンピュータビジョン

画像分類

物体認識、シーン分類など様々な画像分類タスクに利用可能

特徴抽出

他の視覚タスクのバックボーンネットワークとして使用可能で、高品質な画像特徴表現を抽出

🚀 vit_base_patch14_dinov2.lvd142m

Vision Transformer (ViT)を用いた画像特徴抽出モデルです。自己教師付き学習のDINOv2手法を使ってLVD-142Mデータセットで事前学習されています。

🚀 クイックスタート

このモデルは、画像分類や画像埋め込みのタスクに使用できます。以下のセクションで具体的な使用方法を説明します。

✨ 主な機能

画像分類タスクに適用可能
画像埋め込みを生成できる

📦 インストール

このモデルを使用するには、timmライブラリをインストールする必要があります。

pip install timm

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch14_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch14_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1370, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	画像分類 / 特徴抽出バックボーン
パラメータ数 (M)	86.6
GMACs	151.7
アクティベーション数 (M)	397.6
画像サイズ	518 x 518
論文	- DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
オリジナルリポジトリ	https://github.com/facebookresearch/dinov2
事前学習データセット	LVD-142M

モデル比較

timmのモデル結果でこのモデルのデータセットと実行時間のメトリクスを確認できます。

📄 ライセンス

このモデルはApache-2.0ライセンスの下で提供されています。

🔖 引用

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}