vit_base_patch32_224.orig_in21k開源圖像分類模型

首頁

Vit Base Patch32 224.orig In21k

由timm開發

基於Vision Transformer (ViT)的圖像分類模型，在ImageNet-21k上預訓練，適用於特徵提取和微調場景。

圖像分類

Transformers

開源協議:Apache-2.0 #ViT骨幹網絡 #ImageNet21k預訓練 #無分類頭特徵提取

下載量 438

發布時間 : 11/17/2023

模型概述

該模型是一個基於Vision Transformer架構的圖像分類模型，由論文作者在ImageNet-21k數據集上使用JAX預訓練，後移植到PyTorch。模型不包含分類頭，適合用於特徵提取和下游任務的微調。

模型特點

基於Transformer架構

採用Vision Transformer架構，將圖像分割為32x32的patch進行處理，適用於大規模圖像識別任務。

預訓練權重

在ImageNet-21k大規模數據集上預訓練，具有強大的特徵提取能力。

靈活的特徵提取

模型不包含分類頭，可以直接用於特徵提取或下游任務的微調。

模型能力

圖像特徵提取

圖像分類

遷移學習

使用案例

計算機視覺

圖像分類

使用預訓練模型進行圖像分類任務，或在其基礎上微調特定領域的分類器。

特徵提取

提取圖像的高級特徵表示，用於下游任務如目標檢測、圖像檢索等。

🚀 vit_base_patch32_224.orig_in21k模型

這是一個基於Vision Transformer (ViT) 的圖像分類模型。該模型由論文作者在JAX中基於ImageNet - 21k數據集進行預訓練，後由Ross Wightman移植到PyTorch。此模型沒有分類頭，僅適用於特徵提取和微調。

🚀 快速開始

本模型是基於Vision Transformer (ViT) 架構的圖像分類模型，可用於圖像特徵提取和微調。以下是使用示例：

💻 使用示例

基礎用法

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_base_patch32_224.orig_in21k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

高級用法

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch32_224.orig_in21k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 50, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

✨ 主要特性

基於Vision Transformer (ViT) 架構，適用於圖像分類任務。
在ImageNet - 21k數據集上進行預訓練。
模型沒有分類頭，可用於特徵提取和微調。

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像分類/特徵骨幹網絡
模型參數（百萬）	87.5
GMACs	4.4
激活值（百萬）	4.2
圖像尺寸	224 x 224
相關論文	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
訓練數據集	ImageNet - 21k
原始代碼庫	https://github.com/google-research/vision_transformer

模型對比

你可以在timm 模型結果中查看該模型的數據集和運行時指標。

引用信息

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}