Vit-base-patch16-224 Open-source Image Classification Model - Free to Implement High-precision Image Classification Applications

Vit Base Patch16 224

Developed by optimum

Image classification model based on Transformer architecture, pre-trained and fine-tuned on ImageNet-21k and ImageNet-1k datasets

Image Classification

Transformers

Open Source License:Apache-2.0 #Image Classification #Transformer Architecture #High-Precision Classification

Downloads 40

Release Time : 6/23/2022

Model Overview

ViT is a visual model that divides images into 16x16 patches and processes them through a Transformer encoder, primarily used for image classification tasks

Model Features

Transformer-Based Visual Processing

Processes images into token sequences similar to NLP tasks, innovatively applying Transformer architecture to visual data

Large-Scale Pre-training

Pre-trained on ImageNet-21k (14 million images, 21k classes) and fine-tuned on ImageNet-1k (1 million images, 1k classes)

High-Resolution Support

Supports 224x224 and 384x384 resolution inputs, with higher resolutions yielding better results

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

General Image Classification

Classifies images into 1000 ImageNet categories

Achieves excellent accuracy on the ImageNet validation set

Visual Feature Extraction

Extracts image features for downstream tasks

🚀 ONNX convert of ViT (base-sized model)

This is a conversion of ViT-base, which comes with a classification head for image classification.

🚀 Quick Start

This Vision Transformer (ViT) model was pre-trained on ImageNet-21k (14 million images, 21,843 classes) at a resolution of 224x224, and then fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at the same resolution. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. The weights were converted from the timm repository by Ross Wightman.

Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.

✨ Features

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT) pretrained on a large image collection (ImageNet-21k) in a supervised way at a resolution of 224x224 pixels. Then, it was fine-tuned on ImageNet (ILSVRC2012), a dataset with 1 million images and 1,000 classes, also at 224x224 resolution.

Images are presented to the model as a sequence of fixed-size patches (16x16 resolution), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.

Through pre-training, the model learns an internal representation of images, which can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre-trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as its last hidden state can represent the whole image.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub for fine-tuned versions on tasks that interest you.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import AutoFeatureExtractor
from optimum.onnxruntime import ORTModelForImageClassification
from optimum.pipelines import pipeline

feature_extractor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
# Loading already converted and optimized ORT checkpoint for inference
model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")

onnx_img_classif = pipeline(
    "image-classification", model=model, feature_extractor=feature_extractor
)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

pred = onnx_img_classif(url)
print("Top-5 predicted classes:", pred)

📚 Documentation

Training data

The ViT model was pretrained on ImageNet-21k, a dataset with 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset with 1 million images and 1k classes.

Training procedure

Preprocessing

The exact details of image preprocessing during training/validation can be found here.

Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at a global norm of 1. The training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will lead to better performance.

🔧 Technical Details

No additional technical details beyond what's already covered are provided, so this section is skipped.

📄 License

The model is licensed under the Apache-2.0 license.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

Property	Details
Model Type	ONNX convert of ViT (base-sized model)
Training Data	Pretrained on ImageNet-21k (14 million images, 21,843 classes), fine-tuned on ImageNet 2012 (1 million images, 1,000 classes)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご