vit_small_patch16_224.dino開源圖像特徵模型 - 用於圖像分類與特徵提取

首頁

Vit Small Patch16 224.dino

由timm開發

基於視覺Transformer（ViT）的圖像特徵模型，採用自監督DINO方法訓練，適用於圖像分類和特徵提取任務。

圖像分類

Transformers

開源協議:Apache-2.0 #自監督ViT #小參數量ViT #圖像特徵提取

下載量 70.62k

發布時間 : 12/22/2022

模型概述

該模型是一種基於視覺Transformer（ViT）的圖像特徵模型，採用自監督DINO方法訓練。主要用於圖像分類和作為特徵主幹網絡，適用於各種計算機視覺任務。

模型特點

自監督學習

採用DINO自監督學習方法訓練，無需大量標註數據即可學習有效的視覺表示。

高效架構

基於Vision Transformer架構，參數量為21.7M，GMACs運算量為4.3，適合中等規模計算需求。

多任務支持

既可用於圖像分類，也可作為特徵提取主幹網絡，支持多種下游計算機視覺任務。

模型能力

圖像特徵提取

圖像分類

計算機視覺任務支持

使用案例

計算機視覺

圖像分類

對輸入圖像進行分類，輸出類別概率分佈。

在ImageNet-1k數據集上表現良好

特徵提取

提取圖像的深度特徵表示，可用於下游任務如目標檢測、圖像檢索等。

提供384維特徵向量

🚀 vit_small_patch16_224.dino模型卡

這是一個視覺變換器（ViT）圖像特徵模型，採用自監督DINO方法進行訓練，可用於圖像特徵提取等任務。

🚀 快速開始

本模型是一個視覺變換器（ViT）圖像特徵模型，使用自監督DINO方法進行訓練，可用於圖像分類和特徵提取。

✨ 主要特性

採用自監督DINO方法訓練，能有效學習圖像特徵。
可用於圖像分類和圖像嵌入提取任務。

📦 安裝指南

文檔未提及安裝步驟，可參考timm庫的官方安裝說明進行安裝。

💻 使用示例

基礎用法

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_small_patch16_224.dino', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

高級用法

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_small_patch16_224.dino',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 197, 384) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像分類 / 特徵骨幹網絡
模型統計信息	參數數量（M）：21.7；GMACs：4.3；激活值數量（M）：8.2；圖像大小：224 x 224
相關論文	Emerging Properties in Self-Supervised Vision Transformers: https://arxiv.org/abs/2104.14294；An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
預訓練數據集	ImageNet-1k
原始代碼庫	https://github.com/facebookresearch/dino

模型比較

可在timm 模型結果中查看該模型的數據集和運行時指標。

📄 許可證

本項目採用Apache-2.0許可證。

📚 引用

@inproceedings{caron2021emerging,
  title={Emerging properties in self-supervised vision transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{'e}gou, Herv{'e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={9650--9660},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}