đ Vision Transformer (base-sized model) pre-trained with MSN
A Vision Transformer (ViT) model pre-trained using the MSN method, which offers effective feature extraction for downstream vision tasks.
đ Quick Start
The Vision Transformer (ViT) pre - trained with MSN is a powerful model for various vision tasks. It can be easily integrated into your projects for feature extraction and fine - tuning.
⨠Features
- Joint - embedding Architecture: MSN uses a joint - embedding architecture to match masked and unmasked patch prototypes, achieving excellent performance in low - shot and extreme low - shot scenarios.
- Feature Extraction: The pre - trained model can learn an inner representation of images, which is useful for extracting features for downstream tasks.
- Label - Efficient Learning: Particularly beneficial when you have a limited number of labeled samples in your training set.
đĻ Installation
There is no specific installation step provided in the original document. If you want to use the model, you need to install the transformers
library. You can install it using the following command:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import AutoFeatureExtractor, ViTMSNModel
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-base")
model = ViTMSNModel.from_pretrained("facebook/vit-msn-base")
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Advanced Usage
from transformers import AutoFeatureExtractor, ViTMSNForImageClassification
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-base")
model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-base")
...
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT - like). Images are presented to the model as a sequence of fixed - size patches.
MSN presents a joint - embedding architecture to match the prototypes of masked patches with that of the unmasked patches. With this setup, their method yields excellent performance in the low - shot and extreme low - shot regimes.
By pre - training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder.
Intended uses & limitations
You can use the raw model for downstream tasks like image classification. See the model hub to look for different versions of MSN pre - trained models that interest you. The model is particularly beneficial when you have a few labeled samples in your training set.
Citation
@article{assran2022masked,
title={Masked Siamese Networks for Label - Efficient Learning},
author={Assran, Mahmoud, and Caron, Mathilde, and Misra, Ishan, and Bojanowski, Piotr, and Bordes, Florian and Vincent, Pascal, and Joulin, Armand, and Rabbat, Michael, and Ballas, Nicolas},
journal={arXiv preprint arXiv:2204.07141},
year={2022}
}
đ License
This model is licensed under the Apache 2.0 license.
đ Information Table
Property |
Details |
Model Type |
Vision Transformer (base - sized model) pre - trained with MSN |
Training Data |
ImageNet - 1K |
â ī¸ Important Note
The team releasing MSN did not write a model card for this model so this model card has been written by the Hugging Face team.