🚀 Vision Transformer (small-sized model, patch size 8) trained using DINO
A Vision Transformer (ViT) model trained with the DINO method, which can learn image representations for downstream tasks.
🚀 Quick Start
The Vision Transformer (ViT) model in this repository is trained using the DINO method. It was first introduced in the paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and initially released in this repository.
Disclaimer: The team releasing DINO did not write a model card for this model, so this model card is created by the Hugging Face team.
✨ Features
- Self - Supervised Learning: Pretrained on ImageNet - 1k in a self - supervised manner, learning useful image representations.
- Patch - Based Input: Processes images as a sequence of fixed - size 8x8 patches.
- Feature Extraction: Can be used to extract features for downstream tasks.
📚 Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It is pretrained on a large collection of images (ImageNet - 1k) at a resolution of 224x224 pixels in a self - supervised way.
Images are fed into the model as a sequence of fixed - size patches (resolution 8x8), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are also added before the sequence is fed into the Transformer encoder layers.
Note that this model does not include any fine - tuned heads.
Through pre - training, the model learns an internal representation of images. These representations can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be regarded as a representation of the entire image.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.
How to use
Here is how to use this model:
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('facebook/dino-vits8')
model = ViTModel.from_pretrained('facebook/dino-vits8')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2104-14294,
author = {Mathilde Caron and
Hugo Touvron and
Ishan Misra and
Herv{\'{e}} J{\'{e}}gou and
Julien Mairal and
Piotr Bojanowski and
Armand Joulin},
title = {Emerging Properties in Self-Supervised Vision Transformers},
journal = {CoRR},
volume = {abs/2104.14294},
year = {2021},
url = {https://arxiv.org/abs/2104.14294},
archivePrefix = {arXiv},
eprint = {2104.14294},
timestamp = {Tue, 04 May 2021 15:12:43 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
📄 License
This model is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
Vision Transformer (small - sized model, patch size 8) trained using DINO |
Training Data |
ImageNet - 1k |