vit_base_patch14_reg4_dinov2.lvd142mオープンソース画像特徴モデル - 無料の事前学習による高精度な画像特徴の抽出

ホーム

Vit Base Patch14 Reg4 Dinov2.lvd142m

timmによって開発

レジスタを備えた視覚トランスフォーマー（ViT）画像特徴モデルで、自己教師ありのDINOv2手法を用いてLVD-142Mデータセットで事前学習されています。

画像分類

Transformers

オープンソースライセンス:Apache-2.0 #自己教師あり視覚特徴 #レジスタ強化ViT #大規模画像処理

ダウンロード数 40.95k

リリース時間 : 10/30/2023

モデル概要

このモデルは視覚トランスフォーマー（ViT）アーキテクチャに基づく画像特徴抽出バックボーンで、特に性能向上のためにレジスタ機構が追加されています。主に画像分類と特徴抽出タスクに使用されます。

モデル特徴

レジスタ強化

モデルはレジスタ機構を採用し、視覚トランスフォーマーの性能を向上させています

自己教師あり事前学習

DINOv2自己教師あり学習手法を使用してLVD-142Mデータセットで事前学習されています

大サイズ入力対応

518×518ピクセルの大サイズ画像入力をサポートしています

モデル能力

画像特徴抽出

画像分類

画像埋め込み表現生成

使用事例

コンピュータビジョン

画像分類

一般的な画像分類タスクに使用可能

特徴抽出

下流の視覚タスクに特徴表現を提供するバックボーンとして使用可能

🚀 vit_base_patch14_reg4_dinov2.lvd142m のモデルカード

レジスタを備えたビジョントランスフォーマー（ViT）の画像特徴モデルです。自己教師あり学習のDINOv2手法を用いてLVD - 142Mで事前学習されています。

🚀 クイックスタート

このモデルは画像分類や特徴抽出に使用できます。以下のセクションで具体的な使用方法を説明します。

✨ 主な機能

画像分類と特徴抽出に適したバックボーンモデルです。
自己教師あり学習のDINOv2手法を用いてLVD - 142Mで事前学習されています。

📦 インストール

このモデルを使用するには、timm ライブラリをインストールする必要があります。

pip install timm

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch14_reg4_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	画像分類 / 特徴バックボーン
パラメータ数 (M)	86.6
GMACs	117.5
アクティベーション数 (M)	115.0
画像サイズ	518 x 518
論文	- Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588 - DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
オリジナル	https://github.com/facebookresearch/dinov2
事前学習データセット	LVD - 142M

モデル比較

timmのモデル結果でこのモデルのデータセットと実行時間のメトリクスを調査できます。

📄 ライセンス

このモデルはApache - 2.0ライセンスの下で提供されています。

引用

@article{darcet2023vision,
  title={Vision Transformers Need Registers},
  author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv preprint arXiv:2309.16588},
  year={2023}
}

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}