đ ONNX convert of ViT (base-sized model)
This is a conversion of ViT-base, which comes with a classification head for image classification.
đ Quick Start
This Vision Transformer (ViT) model was pre-trained on ImageNet-21k (14 million images, 21,843 classes) at a resolution of 224x224, and then fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at the same resolution. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. The weights were converted from the timm repository by Ross Wightman.
Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.
⨠Features
Model description
The Vision Transformer (ViT) is a transformer encoder model (similar to BERT) pretrained on a large image collection (ImageNet-21k) in a supervised way at a resolution of 224x224 pixels. Then, it was fine-tuned on ImageNet (ILSVRC2012), a dataset with 1 million images and 1,000 classes, also at 224x224 resolution.
Images are presented to the model as a sequence of fixed-size patches (16x16 resolution), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.
Through pre-training, the model learns an internal representation of images, which can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre-trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as its last hidden state can represent the whole image.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub for fine-tuned versions on tasks that interest you.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import AutoFeatureExtractor
from optimum.onnxruntime import ORTModelForImageClassification
from optimum.pipelines import pipeline
feature_extractor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
onnx_img_classif = pipeline(
"image-classification", model=model, feature_extractor=feature_extractor
)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = onnx_img_classif(url)
print("Top-5 predicted classes:", pred)
đ Documentation
Training data
The ViT model was pretrained on ImageNet-21k, a dataset with 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset with 1 million images and 1k classes.
Training procedure
Preprocessing
The exact details of image preprocessing during training/validation can be found here.
Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Pretraining
The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at a global norm of 1. The training resolution is 224.
Evaluation results
For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will lead to better performance.
đ§ Technical Details
No additional technical details beyond what's already covered are provided, so this section is skipped.
đ License
The model is licensed under the Apache-2.0 license.
BibTeX entry and citation info
@misc{wu2020visual,
title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
year={2020},
eprint={2006.03677},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{deng2009imagenet,
title={Imagenet: A large-scale hierarchical image database},
author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
booktitle={2009 IEEE conference on computer vision and pattern recognition},
pages={248--255},
year={2009},
organization={Ieee}
}
Property |
Details |
Model Type |
ONNX convert of ViT (base-sized model) |
Training Data |
Pretrained on ImageNet-21k (14 million images, 21,843 classes), fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) |