đ Vision Transformer (large-sized model) pre-trained with MSN
A Vision Transformer (ViT) model pre-trained using the MSN method, which is useful for image-related downstream tasks.
đ Quick Start
The Vision Transformer (ViT) model here is pre-trained using the MSN method. It was first introduced in the paper Masked Siamese Networks for Label-Efficient Learning by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas and released in this repository. Note that the team releasing MSN didn't write a model card for this model, and this one is written by the Hugging Face team.
⨠Features
- Joint-embedding Architecture: MSN uses a joint-embedding architecture to match the prototypes of masked patches with that of the unmasked patches, achieving excellent performance in low-shot and extreme low-shot regimes.
- Feature Extraction for Downstream Tasks: Through pre-training, the model learns an inner representation of images, which can be used to extract features for downstream tasks. For example, you can train a standard classifier by adding a linear layer on top of the pre-trained encoder.
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). Images are fed into the model as a sequence of fixed-size patches. MSN's joint-embedding architecture helps match the prototypes of masked and unmasked patches. After pre-training, the model can learn an inner representation of images, which is useful for extracting features for downstream tasks.
Intended uses & limitations
You can use the raw model for downstream tasks such as image classification. Check the model hub for different versions of MSN pre-trained models. This model is especially useful when you have a small number of labeled samples in your training set.
đģ Usage Examples
Basic Usage
Here is how to use this backbone encoder:
from transformers import AutoFeatureExtractor, ViTMSNModel
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-large")
model = ViTMSNModel.from_pretrained("facebook/vit-msn-large")
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Advanced Usage
For fine-tuning on image classification, use the ViTMSNForImageClassification
class:
from transformers import AutoFeatureExtractor, ViTMSNForImageClassification
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-large")
model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-large")
...
Citation
@article{assran2022masked,
title={Masked Siamese Networks for Label-Efficient Learning},
author={Assran, Mahmoud, and Caron, Mathilde, and Misra, Ishan, and Bojanowski, Piotr, and Bordes, Florian and Vincent, Pascal, and Joulin, Armand, and Rabbat, Michael, and Ballas, Nicolas},
journal={arXiv preprint arXiv:2204.07141},
year={2022}
}
đ License
This project is licensed under the Apache-2.0 license.
Property |
Details |
Model Type |
Vision Transformer (large-sized model) pre-trained with MSN |
Training Data |
ImageNet-1K |