đ Vision Transformer (base-sized model) - Hybrid
The hybrid Vision Transformer (ViT) is a model that combines a convolutional backbone with a Transformer encoder, achieving excellent results in image classification tasks with fewer computational resources.
đ Quick Start
The hybrid Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy et al. It's the first paper that successfully trains a Transformer encoder on ImageNet, outperforming familiar convolutional architectures. ViT hybrid is a variant of the plain Vision Transformer, using a convolutional backbone (BiT) to generate initial "tokens" for the Transformer.
Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.
⨠Features
- Breakthrough in Vision: Demonstrates that a pure Transformer on image patches can excel in image classification, reducing reliance on CNNs.
- Resource - Efficient: Achieves excellent results on multiple benchmarks with fewer computational resources for training.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import ViTHybridImageProcessor, ViTHybridForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTHybridImageProcessor.from_pretrained('google/vit-hybrid-base-bit-384')
model = ViTHybridForImageClassification.from_pretrained('google/vit-hybrid-base-bit-384')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
>>> tabby, tabby cat
For more code examples, refer to the documentation.
đ Documentation
Model description
While the Transformer architecture is the standard in natural language processing, its use in computer vision was limited. This model shows that a pure Transformer on image patches can perform well in image classification. When pre - trained on large data and transferred to various benchmarks, it outperforms state - of - the - art CNNs with fewer training resources.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub for fine - tuned versions for your specific task.
đ§ Technical Details
Training data
The ViT - Hybrid model was pretrained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, which has 1 million images and 1k classes.
Training procedure
Preprocessing
The exact preprocessing details during training/validation are in here. Images are resized/rescaled to 224x224 and normalized with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5) across RGB channels.
Pretraining
The model was trained on TPUv3 hardware (8 cores). All variants are trained with a batch size of 4096 and a 10k - step learning rate warmup. For ImageNet, gradient clipping at global norm 1 is beneficial. The training resolution is 224.
Evaluation results
Refer to tables 2 and 5 of the original paper for evaluation results on several image classification benchmarks. Higher resolution (384x384) yields better fine - tuning results, and larger model sizes perform better.
BibTeX entry and citation info
@misc{wu2020visual,
title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
year={2020},
eprint={2006.03677},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{deng2009imagenet,
title={Imagenet: A large-scale hierarchical image database},
author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
booktitle={2009 IEEE conference on computer vision and pattern recognition},
pages={248--255},
year={2009},
organization={Ieee}
}
đ License
This model is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
Hybrid Vision Transformer |
Training Data |
Pretrained on ImageNet - 21k, fine - tuned on ImageNet |