vit_large_patch32_224.orig_in21k Open-source Image Classification Model - Supports Feature Extraction and Fine-tuning

Vit Large Patch32 224.orig In21k

Developed by timm

An image classification model based on Vision Transformer (ViT) architecture, pretrained on the ImageNet-21k dataset, suitable for feature extraction and fine-tuning scenarios.

Image Classification

Transformers

Open Source License:Apache-2.0 #Image Feature Extraction #ViT Architecture #ImageNet-21k Pretrained

Downloads 771

Release Time : 12/22/2022

Model Overview

This model is a large Vision Transformer (ViT) model developed by Google Research, primarily used for image classification and feature extraction tasks. It does not include a classification head, making it suitable as a backbone network for fine-tuning or feature extraction.

Model Features

Large-Scale Pretraining

Pretrained on the ImageNet-21k dataset, with powerful feature extraction capabilities

Transformer Architecture

Uses a pure Transformer architecture for image processing, independent of traditional CNN structures

High Compatibility

Migrated from the JAX framework to the PyTorch platform for easy use within the PyTorch ecosystem

Flexible Application

Can be used as a feature extractor or fine-tuned base model, supports removal of the classification head

Model Capabilities

Image Feature Extraction

Image Classification

Transfer Learning

Computer Vision Tasks

Use Cases

Image Classification

General Image Classification

Classifies and recognizes various types of images

Pretrained on the ImageNet-21k dataset, with broad category recognition capabilities

Feature Extraction

Downstream Task Feature Extraction

Provides high-quality image features for other computer vision tasks

Can generate 1024-dimensional feature vectors, suitable for various downstream tasks

🚀 vit_large_patch32_224.orig_in21k

A Vision Transformer (ViT) image classification model. Pretrained on ImageNet - 21k, useful for feature extraction and fine - tuning.

🚀 Quick Start

This is a Vision Transformer (ViT) image classification model. It was pretrained on ImageNet - 21k in JAX by the paper authors and ported to PyTorch by Ross Wightman. The model doesn't have a classification head, making it suitable only for feature extraction and fine - tuning.

✨ Features

Model Type: Image classification / feature backbone
Model Stats:
- Params (M): 305.5
- GMACs: 15.3
- Activations (M): 11.1
- Image size: 224 x 224
Papers:
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dataset: ImageNet - 21k
Original: [https://github.com/google - research/vision_transformer](https://github.com/google - research/vision_transformer)

Property	Details
Model Type	Image classification / feature backbone
Training Data	ImageNet - 21k

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_large_patch32_224.orig_in21k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

Advanced Usage

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_large_patch32_224.orig_in21k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 50, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 Documentation

Explore the dataset and runtime metrics of this model in timm model results.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご