🚀 Vision Transformer (small-sized model, patch size 16) trained using DINO
A Vision Transformer (ViT) model trained with the DINO method, offering valuable image representation for various vision tasks.
🚀 Quick Start
The Vision Transformer (ViT) model presented here is trained using the DINO method. It was first introduced in the paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and initially released in this repository.
Disclaimer: The team releasing DINO did not write a model card for this model, so this model card is created by the Hugging Face team.
✨ Features
- Self-supervised Pretraining: The model is pretrained on ImageNet - 1k in a self - supervised manner, learning rich image representations.
- Patch - based Input: Images are processed as a sequence of fixed - size patches (16x16), enabling efficient encoding.
- Feature Extraction: Suitable for extracting features for downstream vision tasks.
📚 Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It is pretrained on a large collection of images, specifically ImageNet - 1k at a resolution of 224x224 pixels, in a self - supervised fashion.
Images are fed to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are also added before the sequence is fed into the Transformer encoder layers.
Note that this model does not include any fine - tuned heads.
Through pre - training, the model learns an internal representation of images. These representations can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be regarded as a representation of the entire image.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.
How to use
Here is how to use this model:
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16')
model = ViTModel.from_pretrained('facebook/dino-vits16')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2104-14294,
author = {Mathilde Caron and
Hugo Touvron and
Ishan Misra and
Herv{\'{e}} J{\'{e}}gou and
Julien Mairal and
Piotr Bojanowski and
Armand Joulin},
title = {Emerging Properties in Self-Supervised Vision Transformers},
journal = {CoRR},
volume = {abs/2104.14294},
year = {2021},
url = {https://arxiv.org/abs/2104.14294},
archivePrefix = {arXiv},
eprint = {2104.14294},
timestamp = {Tue, 04 May 2021 15:12:43 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
📄 License
This model is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
Vision Transformer (small - sized model, patch size 16) trained using DINO |
Training Data |
ImageNet - 1k |