vit_base_patch8_224.dino開源圖像特徵模型 - 免費用於圖像分類與特徵提取

首頁

Vit Base Patch8 224.dino

由timm開發

基於自監督DINO方法訓練的視覺Transformer（ViT）圖像特徵模型，適用於圖像分類和特徵提取任務。

圖像分類

Transformers

開源協議:Apache-2.0 #自監督視覺Transformer #圖像特徵提取 #高精度分類

下載量 9,287

發布時間 : 12/22/2022

模型概述

該模型是基於自監督學習DINO方法訓練的視覺Transformer（ViT），主要用於圖像分類和作為特徵骨幹網絡。它能夠從圖像中提取高質量的特徵表示，適用於各種計算機視覺任務。

模型特點

自監督學習

採用DINO自監督學習方法訓練，無需大量標註數據即可學習有效的圖像表示

高效特徵提取

能夠提取高質量的圖像特徵表示，適用於下游計算機視覺任務

ViT架構

基於視覺Transformer架構，具有全局感受野和強大的建模能力

預訓練模型

在ImageNet-1k數據集上預訓練，可直接用於遷移學習

模型能力

圖像分類

圖像特徵提取

計算機視覺任務骨幹網絡

使用案例

計算機視覺

圖像分類

使用該模型對圖像進行分類

在ImageNet-1k等基準數據集上表現良好

特徵提取

提取圖像特徵用於下游任務

提供高質量的圖像表示

遷移學習

作為預訓練模型用於特定領域任務的微調

減少訓練數據需求，提高模型性能

🚀 vit_base_patch8_224.dino 模型卡片

這是一個基於視覺變換器（ViT）的圖像特徵模型，採用自監督DINO方法進行訓練。

🚀 快速開始

本模型可用於圖像分類和圖像嵌入提取，以下是使用示例。

✨ 主要特性

模型類型：圖像分類/特徵骨幹網絡
模型統計信息：
- 參數數量（M）：85.8
- GMACs：66.9
- 激活值數量（M）：65.7
- 圖像尺寸：224 x 224
相關論文：
- Emerging Properties in Self-Supervised Vision Transformers: https://arxiv.org/abs/2104.14294
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
預訓練數據集：ImageNet - 1k
原始代碼庫：https://github.com/facebookresearch/dino

📦 安裝指南

文檔中未提及安裝步驟，若需使用timm庫，可通過以下命令安裝：

pip install timm

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch8_224.dino', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch8_224.dino',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 785, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

你可以在timm 模型結果中探索該模型的數據集和運行時指標。

📄 許可證

本項目採用Apache - 2.0許可證。

📖 引用

@inproceedings{caron2021emerging,
  title={Emerging properties in self-supervised vision transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{'e}gou, Herv{'e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={9650--9660},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}