vit_huge_patch14_224.mae開源圖像特徵提取模型 - 高效完成大規模圖像特徵提取

首頁

Vit Huge Patch14 224.mae

由timm開發

基於視覺Transformer(ViT)的大規模圖像特徵提取模型，採用自監督掩碼自編碼器(MAE)方法在ImageNet-1k數據集上預訓練

圖像分類

Transformers

#自監督視覺Transformer #大規模圖像特徵提取 #掩碼自編碼預訓練

下載量 104

發布時間 : 5/9/2023

模型概述

這是一個基於視覺Transformer架構的圖像特徵提取模型，主要用於圖像分類和特徵提取任務。模型採用掩碼自編碼器(MAE)的自監督學習方法進行預訓練，能夠有效捕捉圖像的高級特徵表示。

模型特點

大規模視覺Transformer

採用ViT-Huge架構，包含6.3億參數，能夠處理複雜的視覺特徵

自監督預訓練

使用掩碼自編碼器(MAE)方法進行預訓練，無需大量標註數據

高分辨率處理

支持224×224像素的圖像輸入，能夠捕捉更精細的視覺特徵

模型能力

圖像特徵提取

圖像分類

視覺表示學習

使用案例

計算機視覺

圖像分類

可用於對圖像內容進行分類，如識別物體、場景等

特徵提取

可作為特徵提取器為下游視覺任務提供高質量的圖像表示

🚀 vit_huge_patch14_224.mae模型卡片

這是一個視覺變換器（ViT）圖像特徵模型，使用自監督掩碼自編碼器（MAE）方法在ImageNet - 1k上進行了預訓練，可用於圖像特徵提取等任務。

🚀 快速開始

本模型是基於視覺變換器（ViT）架構的圖像特徵模型，使用自監督掩碼自編碼器（MAE）方法在ImageNet - 1k數據集上進行預訓練。以下是使用示例：

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_huge_patch14_224.mae', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_huge_patch14_224.mae',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 257, 1280) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像分類 / 特徵骨幹網絡
模型統計信息	參數數量（M）：630.8 GMACs：167.4 激活值數量（M）：139.4 圖像尺寸：224 x 224
相關論文	Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
預訓練數據集	ImageNet - 1k
原始代碼庫	https://github.com/facebookresearch/mae

模型比較

你可以在timm 模型結果中探索該模型的數據集和運行時指標。

引用

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}