đ BEiT (large-sized model, fine-tuned on ImageNet-22k)
BEiT is a model pre-trained in a self-supervised manner on ImageNet-22k (also known as ImageNet-21k, with 14 million images and 21,841 classes) at a resolution of 224x224. It is then fine-tuned on the same dataset at the same resolution. This model offers a powerful solution for image classification tasks, leveraging the pre-trained knowledge to perform well on various downstream applications.
đ Quick Start
The BEiT model can be used for image classification. Here is a simple example of using this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import BeitImageProcessor, BeitForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = BeitImageProcessor.from_pretrained('microsoft/beit-large-patch16-224-pt22k-ft22k')
model = BeitForImageClassification.from_pretrained('microsoft/beit-large-patch16-224-pt22k-ft22k')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
Currently, both the feature extractor and model support PyTorch.
⨠Features
- Self-supervised Pre-training: BEiT is pre-trained on a large collection of images (ImageNet-21k) in a self-supervised fashion, which enables it to learn a rich inner representation of images.
- Fine-tuning on ImageNet: After pre-training, the model is fine-tuned on ImageNet, a widely used dataset for image classification, to improve its performance on downstream tasks.
- Relative Position Embeddings: Unlike the original ViT models, BEiT models use relative position embeddings, which can better capture the spatial relationships between patches.
- Mean-pooling for Classification: BEiT performs image classification by mean-pooling the final hidden states of the patches, providing an alternative approach to the traditional method of using the [CLS] token.
đ Documentation
Model description
The BEiT model is a Vision Transformer (ViT), a transformer encoder model similar to BERT. Different from the original ViT model, BEiT is pretrained on a large collection of images (ImageNet-21k) in a self-supervised manner at a resolution of 224x224 pixels. The pre-training objective is to predict visual tokens from the encoder of OpenAI's DALL-E's VQ-VAE based on masked patches.
Next, the model is fine-tuned in a supervised manner on ImageNet (also known as ILSVRC2012), a dataset with 1 million images and 1,000 classes, at the same resolution of 224x224.
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. BEiT models use relative position embeddings (similar to T5) instead of absolute position embeddings and classify images by mean-pooling the final hidden states of the patches, rather than placing a linear layer on top of the final hidden state of the [CLS] token.
By pre-training the model, it learns an inner representation of images that can be used to extract features for downstream tasks. For example, if you have a dataset of labeled images, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. You can either use the [CLS] token or mean-pool the final hidden states of the patch embeddings and then place a linear layer on top.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub to find fine-tuned versions for tasks that interest you.
Training data
The BEiT model was pretrained on ImageNet-21k, a dataset with 14 million images and 21k classes, and fine-tuned on the same dataset.
Training procedure
Preprocessing
The exact details of image preprocessing during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Pretraining
For all pre-training related hyperparameters, refer to page 15 of the original paper.
Evaluation results
For evaluation results on several image classification benchmarks, refer to tables 1 and 2 of the original paper. Note that for fine-tuning, better results are obtained with a higher resolution. Of course, increasing the model size will also improve performance.
đ License
This model is released under the Apache-2.0 license.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2106-08254,
author = {Hangbo Bao and
Li Dong and
Furu Wei},
title = {BEiT: {BERT} Pre-Training of Image Transformers},
journal = {CoRR},
volume = {abs/2106.08254},
year = {2021},
url = {https://arxiv.org/abs/2106.08254},
archivePrefix = {arXiv},
eprint = {2106.08254},
timestamp = {Tue, 29 Jun 2021 16:55:04 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2106-08254.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{deng2009imagenet,
title={Imagenet: A large-scale hierarchical image database},
author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
booktitle={2009 IEEE conference on computer vision and pattern recognition},
pages={248--255},
year={2009},
organization={Ieee}
}