vit_base_patch8_224.dino开源图像特征模型 - 免费用于图像分类与特征提取

Home

Vit Base Patch8 224.dino

Developed by timm

基于自监督DINO方法训练的视觉Transformer（ViT）图像特征模型，适用于图像分类和特征提取任务。

图像分类

Transformers

Open Source License:Apache-2.0 #自监督视觉Transformer #图像特征提取 #高精度分类

Downloads 9,287

Release Time : 12/22/2022

Model Overview

该模型是基于自监督学习DINO方法训练的视觉Transformer（ViT），主要用于图像分类和作为特征骨干网络。它能够从图像中提取高质量的特征表示，适用于各种计算机视觉任务。

Model Features

自监督学习

采用DINO自监督学习方法训练，无需大量标注数据即可学习有效的图像表示

高效特征提取

能够提取高质量的图像特征表示，适用于下游计算机视觉任务

ViT架构

基于视觉Transformer架构，具有全局感受野和强大的建模能力

预训练模型

在ImageNet-1k数据集上预训练，可直接用于迁移学习

Model Capabilities

图像分类

图像特征提取

计算机视觉任务骨干网络

Use Cases

计算机视觉

图像分类

使用该模型对图像进行分类

在ImageNet-1k等基准数据集上表现良好

特征提取

提取图像特征用于下游任务

提供高质量的图像表示

迁移学习

作为预训练模型用于特定领域任务的微调

减少训练数据需求，提高模型性能

🚀 vit_base_patch8_224.dino 模型卡片

这是一个基于视觉变换器（ViT）的图像特征模型，采用自监督DINO方法进行训练。

🚀 快速开始

本模型可用于图像分类和图像嵌入提取，以下是使用示例。

✨ 主要特性

模型类型：图像分类/特征骨干网络
模型统计信息：
- 参数数量（M）：85.8
- GMACs：66.9
- 激活值数量（M）：65.7
- 图像尺寸：224 x 224
相关论文：
- Emerging Properties in Self-Supervised Vision Transformers: https://arxiv.org/abs/2104.14294
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
预训练数据集：ImageNet - 1k
原始代码库：https://github.com/facebookresearch/dino

📦 安装指南

文档中未提及安装步骤，若需使用timm库，可通过以下命令安装：

pip install timm

💻 使用示例

基础用法

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch8_224.dino', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch8_224.dino',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 785, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 详细文档

你可以在timm 模型结果中探索该模型的数据集和运行时指标。

📄 许可证

本项目采用Apache - 2.0许可证。

📖 引用

@inproceedings{caron2021emerging,
  title={Emerging properties in self-supervised vision transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{'e}gou, Herv{'e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={9650--9660},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}