vit_large_patch16_224.mae開源圖像特徵提取模型

首頁

Vit Large Patch16 224.mae

由timm開發

基於視覺變換器(ViT)的大型圖像特徵提取模型，採用自監督掩碼自編碼器(MAE)方法在ImageNet-1k數據集上預訓練

圖像分類

Transformers

#自監督視覺特徵 #高參數量ViT #圖像語義編碼

下載量 960

發布時間 : 5/9/2023

模型概述

該模型是一個視覺變換器架構的大型圖像特徵提取模型，主要用於圖像分類和特徵提取任務。通過掩碼自編碼器(MAE)的自監督學習方法在ImageNet-1k數據集上進行預訓練。

模型特點

自監督預訓練

採用掩碼自編碼器(MAE)方法進行自監督預訓練，無需大量標註數據即可學習有效特徵表示

大規模視覺變換器

基於ViT-Large架構，具有303.3M參數，能夠捕捉豐富的視覺特徵

高效特徵提取

支持提取圖像全局特徵或局部patch特徵，適用於多種下游視覺任務

模型能力

圖像分類

圖像特徵提取

視覺表示學習

使用案例

計算機視覺

圖像分類

可用於對圖像進行分類，支持1000類ImageNet分類任務

特徵提取

可作為特徵提取器用於下游視覺任務，如目標檢測、圖像分割等

🚀 vit_large_patch16_224.mae 模型卡片

這是一個視覺變換器（ViT）圖像特徵模型，使用自監督掩碼自編碼器（MAE）方法在 ImageNet - 1k 上進行了預訓練，可用於圖像特徵提取等任務。

🚀 快速開始

本模型是基於 Vision Transformer（ViT）架構的圖像特徵模型，在 ImageNet - 1k 數據集上使用自監督的 Masked Autoencoder（MAE）方法進行預訓練。下面為你展示如何使用該模型進行圖像分類和提取圖像嵌入。

✨ 主要特性

模型類型：圖像分類/特徵主幹網絡
模型統計信息：
- 參數數量（百萬）：303.3
- GMACs：61.6
- 激活值（百萬）：63.5
- 圖像尺寸：224 x 224
相關論文：
- Masked Autoencoders Are Scalable Vision Learners
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
預訓練數據集：ImageNet - 1k
原始代碼庫：https://github.com/facebookresearch/mae

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_large_patch16_224.mae', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_large_patch16_224.mae',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 197, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

你可以在 timm 模型結果中探索該模型的數據集和運行時指標。

📄 許可證

本模型採用 CC - BY - NC - 4.0 許可證。

📖 引用

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}