đ Vision Transformer (base sized model) pre-trained with MSN (patch size of 4)
A Vision Transformer (ViT) model pre-trained using the MSN method, which can learn inner representations of images for downstream tasks.
đ Quick Start
The Vision Transformer (ViT) presented here is a pre - trained model using the MSN method. It can be used for various downstream tasks such as image classification.
⨠Features
- Joint - embedding architecture: MSN uses a joint - embedding architecture to match the prototypes of masked patches with unmasked patches, achieving excellent performance in low - shot and extreme low - shot regimes.
- Feature learning: Through pre - training, the model can learn inner representations of images, which can be used to extract features for downstream tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Here is how to use this backbone encoder:
from transformers import AutoFeatureExtractor, ViTMSNModel
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-base-4")
model = ViTMSNModel.from_pretrained("facebook/vit-msn-base-4")
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Advanced Usage
For fine - tuning on image classification use the ViTMSNForImageClassification
class:
from transformers import AutoFeatureExtractor, ViTMSNForImageClassification
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-base-4")
model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-base-4")
...
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT - like). Images are presented to the model as a sequence of fixed - size patches.
MSN presents a joint - embedding architecture to match the prototypes of masked patches with that of the unmasked patches. With this setup, their method yields excellent performance in the low - shot and extreme low - shot regimes.
By pre - training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder.
Intended uses & limitations
You can use the raw model for downstream tasks like image classification. See the model hub to look for different versions of MSN pre - trained models that interest you. The model is particularly beneficial when you have a few labeled samples in your training set.
đ§ Technical Details
The Vision Transformer (ViT) is a BERT - like transformer encoder model. Images are divided into fixed - size patches and presented to the model as a sequence. The MSN method uses a joint - embedding architecture to match the prototypes of masked and unmasked patches, which is effective in low - shot and extreme low - shot scenarios. Pre - training allows the model to learn inner representations of images, which can be used for feature extraction in downstream tasks.
đ License
This model is licensed under the Apache - 2.0 license.
Citation
@article{assran2022masked,
title={Masked Siamese Networks for Label - Efficient Learning},
author={Assran, Mahmoud, and Caron, Mathilde, and Misra, Ishan, and Bojanowski, Piotr, and Bordes, Florian and Vincent, Pascal, and Joulin, Armand, and Rabbat, Michael, and Ballas, Nicolas},
journal={arXiv preprint arXiv:2204.07141},
year={2022}
}