vit_base_patch16_224.dino開源圖像模型 - 免費用於圖像分類及特徵提取

Home

Vit Base Patch16 224.dino

Developed by timm

基於自監督DINO方法訓練的Vision Transformer（ViT）圖像特徵模型，適用於圖像分類和特徵提取任務。

圖像分類

Transformers

Open Source License:Apache-2.0 #自監督學習 #圖像特徵提取 #視覺Transformer

Downloads 33.45k

Release Time : 12/22/2022

Model Overview

該模型是基於DINO自監督學習方法訓練的Vision Transformer，主要用於圖像分類和作為特徵提取的主幹網絡。

Model Features

自監督學習

使用DINO方法進行自監督訓練，無需大量標註數據即可學習有效的視覺表示。

Vision Transformer架構

採用標準的ViT-B/16架構，將圖像分割為16x16的patch進行處理。

高效特徵提取

可作為特徵提取的主幹網絡，輸出768維的特徵向量。

Model Capabilities

圖像分類

圖像特徵提取

視覺表示學習

Use Cases

計算機視覺

圖像分類

對圖像進行分類，輸出ImageNet-1k中的類別概率。

特徵提取

提取圖像的高級特徵表示，可用於下游任務如目標檢測、圖像檢索等。

🚀 Vit_base_patch16_224.dino 模型卡片

這是一個基於視覺變換器（ViT）的圖像特徵提取模型，採用自監督DINO方法進行訓練。

🚀 快速開始

本模型可用於圖像分類和圖像嵌入提取，以下是使用示例。

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch16_224.dino', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_224.dino',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 197, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像分類/特徵主幹網絡
模型統計信息	參數數量（M）：85.8；GMACs：16.9；激活值數量（M）：16.5；圖像尺寸：224 x 224
相關論文	Self-Supervised Vision Transformers 中的新興特性；一張圖像值 16x16 個單詞：大規模圖像識別的變換器
預訓練數據集	ImageNet-1k
原始代碼庫	https://github.com/facebookresearch/dino

模型對比

你可以在 timm 模型結果中查看該模型的數據集和運行時指標。

引用信息

@inproceedings{caron2021emerging,
  title={Emerging properties in self-supervised vision transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{'e}gou, Herv{'e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={9650--9660},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}