đ Vision Transformer (base-sized model)
The Vision Transformer (ViT) is a pre - trained model on ImageNet - 21k and fine - tuned on ImageNet 2012, which can be used for image classification tasks.
đ Quick Start
The Vision Transformer (ViT) model is pre - trained on ImageNet - 21k (14 million images, 21,843 classes) at a resolution of 224x224, and then fine - tuned on ImageNet 2012 (1 million images, 1,000 classes) at the same resolution. It was first introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and released in this repository. The weights were converted from the timm repository by Ross Wightman.
Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.
⨠Features
- Powerful Representation: By pre - training on a large - scale image dataset (ImageNet - 21k), the model can learn a strong inner representation of images, which is useful for downstream image - related tasks.
- Fine - Tunable: The model can be fine - tuned on other datasets, such as ImageNet 2012, to adapt to specific tasks.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
For more code examples, refer to the documentation.
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT - like) pretrained on a large collection of images (ImageNet - 21k) in a supervised manner at a resolution of 224x224 pixels. Then, it is fine - tuned on ImageNet (ILSVRC2012), a dataset with 1 million images and 1,000 classes, also at 224x224 resolution.
Images are presented to the model as a sequence of fixed - size patches (16x16), which are linearly embedded. A [CLS] token is added to the beginning of the sequence for classification tasks, and absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.
Through pre - training, the model learns an inner representation of images, which can be used to extract features for downstream tasks. For example, a linear layer can be placed on top of the pre - trained encoder to train a standard classifier. Usually, a linear layer is placed on top of the [CLS] token, as its last hidden state can represent an entire image.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.
Training data
The ViT model was pretrained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, a dataset with 1 million images and 1k classes.
Training procedure
Preprocessing
The exact details of image preprocessing during training/validation can be found here. Images are resized/rescaled to 224x224 and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Pretraining
The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warm - up of 10k steps. For ImageNet, the authors found it beneficial to apply gradient clipping at a global norm of 1. The training resolution is 224.
Evaluation results
For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine - tuning, better results are obtained with a higher resolution (384x384). Increasing the model size will lead to better performance.
đ§ Technical Details
- Model Architecture: Transformer encoder model (BERT - like).
- Input Representation: Images are split into 16x16 patches and linearly embedded.
- Training Process: Pretrained on ImageNet - 21k and fine - tuned on ImageNet.
đ License
This model is licensed under the Apache - 2.0 license.
BibTeX entry and citation info
@misc{wu2020visual,
title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
year={2020},
eprint={2006.03677},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{deng2009imagenet,
title={Imagenet: A large-scale hierarchical image database},
author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li - Jia and Li, Kai and Fei - Fei, Li},
booktitle={2009 IEEE conference on computer vision and pattern recognition},
pages={248--255},
year={2009},
organization={Ieee}
}