đ Vision Transformer (base-sized model, patch size 8) trained using DINO
This is a Vision Transformer (ViT) model trained with the DINO method. It offers an effective way to learn image representations for various vision tasks.
đ Quick Start
The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It's pretrained in a self - supervised manner on a large image collection, specifically ImageNet - 1k, at a 224x224 pixel resolution.
You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.
⨠Features
- Self - supervised Pretraining: The model is pretrained on ImageNet - 1k in a self - supervised way, enabling it to learn general image representations.
- Patch - based Input: Images are presented as a sequence of 8x8 fixed - size patches, which are linearly embedded.
- [CLS] Token for Classification: A [CLS] token is added at the start of the sequence for classification tasks.
- Absolute Position Embeddings: Absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.
đĻ Installation
The installation steps are not provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('facebook/dino-vitb8')
model = ViTModel.from_pretrained('facebook/dino-vitb8')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT - like) pretrained on a large collection of images in a self - supervised fashion, namely ImageNet - 1k, at a resolution of 224x224 pixels.
Images are presented to the model as a sequence of fixed - size patches (resolution 8x8), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
Note that this model does not include any fine - tuned heads.
By pre - training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2104-14294,
author = {Mathilde Caron and
Hugo Touvron and
Ishan Misra and
Herv{\'{e}} J{\'{e}}gou and
Julien Mairal and
Piotr Bojanowski and
Armand Joulin},
title = {Emerging Properties in Self-Supervised Vision Transformers},
journal = {CoRR},
volume = {abs/2104.14294},
year = {2021},
url = {https://arxiv.org/abs/2104.14294},
archivePrefix = {arXiv},
eprint = {2104.14294},
timestamp = {Tue, 04 May 2021 15:12:43 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
đ§ Technical Details
The technical details are well - covered in the "Model description" section, so this section is skipped to avoid redundancy.
đ License
The model is licensed under the Apache - 2.0 license.