vit_base_patch16_224.mae開源圖像特徵提取模型 - 免費助力圖像特徵精準提取

首頁

Vit Base Patch16 224.mae

由timm開發

基於視覺Transformer(ViT)的圖像特徵提取模型，採用自監督掩碼自編碼器(MAE)方法在ImageNet-1k數據集上預訓練

圖像分類

Transformers

#自監督視覺特徵 #圖像語義編碼 #ViT主幹網絡

下載量 23.63k

發布時間 : 5/9/2023

模型概述

這是一個基於Vision Transformer架構的圖像特徵提取模型，主要用於圖像分類和特徵提取任務。模型通過掩碼自編碼器(MAE)的自監督學習方法進行預訓練，能夠有效捕捉圖像特徵。

模型特點

自監督預訓練

採用掩碼自編碼器(MAE)方法進行自監督預訓練，無需大量標註數據

高效特徵提取

基於Vision Transformer架構，能夠有效提取圖像特徵

中等規模模型

85.8百萬參數規模，在計算效率和性能間取得平衡

模型能力

圖像特徵提取

圖像分類

視覺表示學習

使用案例

計算機視覺

圖像分類

可用於對圖像進行分類，如識別物體類別

特徵提取

可作為其他視覺任務的特徵提取器

🚀 vit_base_patch16_224.mae 模型卡

這是一個基於視覺變換器（ViT）的圖像特徵模型，使用自監督掩碼自編碼器（MAE）方法在 ImageNet-1k 數據集上進行了預訓練，可用於圖像特徵提取等任務。

🚀 快速開始

本模型是一個基於視覺變換器（ViT）的圖像特徵模型，使用自監督掩碼自編碼器（MAE）方法在 ImageNet-1k 數據集上進行了預訓練。下面將介紹其使用方法。

✨ 主要特性

模型類型：可用於圖像分類或作為特徵提取的骨幹網絡。
模型統計信息：
- 參數數量（M）：85.8
- GMACs：17.6
- 激活值（M）：23.9
- 圖像尺寸：224 x 224
相關論文：
- Masked Autoencoders Are Scalable Vision Learners
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
預訓練數據集：ImageNet-1k
原始代碼庫：https://github.com/facebookresearch/mae

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch16_224.mae', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_224.mae',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 197, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

可在 timm 模型結果中查看該模型的數據集和運行時指標。

📄 許可證

本模型採用 CC BY-NC 4.0 許可證。

📖 引用

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}