vit-base-patch16-224 Open-source Image Classification Model - Pretraining and Fine-tuning for Precise Image Recognition

Vit Base Patch16 224

Developed by google

Vision Transformer model pre-trained on ImageNet-21k and fine-tuned on ImageNet for image classification tasks

Image Classification Open Source License:Apache-2.0 #Image Classification-1k Classes #ViT Architecture #ImageNet Fine-tuning

Downloads 4.8M

Release Time : 3/2/2022

Model Overview

Vision Transformer (ViT) is a BERT-like transformer encoder model that processes images by dividing them into fixed-size patch sequences, suitable for image classification tasks.

Model Features

Transformer-based Vision Model

Processes images as patch sequences and utilizes transformer architecture for efficient feature extraction

Large-scale Pre-training

Pre-trained on ImageNet-21k (14 million images, 21k classes) with strong feature learning capabilities

High-resolution Processing

Supports 224x224 pixel resolution input, capable of capturing fine-grained image features

Model Capabilities

Image Classification

Feature Extraction

Visual Representation Learning

Use Cases

General Image Recognition

Object Classification

Classifies images into one of 1000 ImageNet categories

Achieves high accuracy on the ImageNet validation set

Feature Extraction

Extracts image features for downstream tasks

Can serve as a pre-trained model for other vision tasks

🚀 Vision Transformer (base-sized model)

The Vision Transformer (ViT) is a pre - trained model on ImageNet - 21k and fine - tuned on ImageNet 2012, which can be used for image classification tasks.

🚀 Quick Start

The Vision Transformer (ViT) model is pre - trained on ImageNet - 21k (14 million images, 21,843 classes) at a resolution of 224x224, and then fine - tuned on ImageNet 2012 (1 million images, 1,000 classes) at the same resolution. It was first introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and released in this repository. The weights were converted from the timm repository by Ross Wightman.

Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.

✨ Features

Powerful Representation: By pre - training on a large - scale image dataset (ImageNet - 21k), the model can learn a strong inner representation of images, which is useful for downstream image - related tasks.
Fine - Tunable: The model can be fine - tuned on other datasets, such as ImageNet 2012, to adapt to specific tasks.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, refer to the documentation.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (BERT - like) pretrained on a large collection of images (ImageNet - 21k) in a supervised manner at a resolution of 224x224 pixels. Then, it is fine - tuned on ImageNet (ILSVRC2012), a dataset with 1 million images and 1,000 classes, also at 224x224 resolution.

Images are presented to the model as a sequence of fixed - size patches (16x16), which are linearly embedded. A [CLS] token is added to the beginning of the sequence for classification tasks, and absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.

Through pre - training, the model learns an inner representation of images, which can be used to extract features for downstream tasks. For example, a linear layer can be placed on top of the pre - trained encoder to train a standard classifier. Usually, a linear layer is placed on top of the [CLS] token, as its last hidden state can represent an entire image.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.

Training data

The ViT model was pretrained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, a dataset with 1 million images and 1k classes.

Training procedure

Preprocessing

The exact details of image preprocessing during training/validation can be found here. Images are resized/rescaled to 224x224 and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warm - up of 10k steps. For ImageNet, the authors found it beneficial to apply gradient clipping at a global norm of 1. The training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine - tuning, better results are obtained with a higher resolution (384x384). Increasing the model size will lead to better performance.

🔧 Technical Details

Model Architecture: Transformer encoder model (BERT - like).
Input Representation: Images are split into 16x16 patches and linearly embedded.
Training Process: Pretrained on ImageNet - 21k and fine - tuned on ImageNet.

📄 License

This model is licensed under the Apache - 2.0 license.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li - Jia and Li, Kai and Fei - Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご