vit-msn-large Open-source Visual Model: Excellent Performance Even in Few-shot Scenarios

Home

Vit Msn Large

Developed by facebook

Vision Transformer model pretrained using MSN method, excels in few-shot scenarios

Image Classification

Transformers

Open Source License:Apache-2.0 #Few-shot Learning #Image Representation Learning #Self-supervised Pretraining

Downloads 48

Release Time : 9/9/2022

Model Overview

This Vision Transformer model is pretrained with Masked Siamese Networks method, particularly suitable for image classification tasks with limited labeled data, capable of learning intrinsic image representations and transferring to downstream tasks

Model Features

Few-shot Learning Capability

Maintains excellent performance in scenarios with limited labeled data through MSN pretraining method

Joint Embedding Architecture

Employs unique training approach of matching masked patches with prototype patches

Transfer Learning Friendly

Pretrained representations can be easily transferred to various downstream vision tasks

Model Capabilities

Image Feature Extraction

Few-shot Image Classification

Visual Representation Learning

Use Cases

Computer Vision

Few-shot Image Classification

Achieves image classification with limited labeled samples

Performs exceptionally well in few-shot and very few-shot scenarios

Visual Feature Extraction

Serves as base encoder for extracting image features

🚀 Vision Transformer (large-sized model) pre-trained with MSN

A Vision Transformer (ViT) model pre-trained using the MSN method, which is useful for image-related downstream tasks.

🚀 Quick Start

The Vision Transformer (ViT) model here is pre-trained using the MSN method. It was first introduced in the paper Masked Siamese Networks for Label-Efficient Learning by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas and released in this repository. Note that the team releasing MSN didn't write a model card for this model, and this one is written by the Hugging Face team.

✨ Features

Joint-embedding Architecture: MSN uses a joint-embedding architecture to match the prototypes of masked patches with that of the unmasked patches, achieving excellent performance in low-shot and extreme low-shot regimes.
Feature Extraction for Downstream Tasks: Through pre-training, the model learns an inner representation of images, which can be used to extract features for downstream tasks. For example, you can train a standard classifier by adding a linear layer on top of the pre-trained encoder.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). Images are fed into the model as a sequence of fixed-size patches. MSN's joint-embedding architecture helps match the prototypes of masked and unmasked patches. After pre-training, the model can learn an inner representation of images, which is useful for extracting features for downstream tasks.

Intended uses & limitations

You can use the raw model for downstream tasks such as image classification. Check the model hub for different versions of MSN pre-trained models. This model is especially useful when you have a small number of labeled samples in your training set.

💻 Usage Examples

Basic Usage

Here is how to use this backbone encoder:

from transformers import AutoFeatureExtractor, ViTMSNModel
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-large")
model = ViTMSNModel.from_pretrained("facebook/vit-msn-large")
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Advanced Usage

For fine-tuning on image classification, use the ViTMSNForImageClassification class:

from transformers import AutoFeatureExtractor, ViTMSNForImageClassification
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-large")
model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-large")

...

Citation

@article{assran2022masked,
  title={Masked Siamese Networks for Label-Efficient Learning}, 
  author={Assran, Mahmoud, and Caron, Mathilde, and Misra, Ishan, and Bojanowski, Piotr, and Bordes, Florian and Vincent, Pascal, and Joulin, Armand, and Rabbat, Michael, and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2204.07141},
  year={2022}
}

📄 License

This project is licensed under the Apache-2.0 license.

Property	Details
Model Type	Vision Transformer (large-sized model) pre-trained with MSN
Training Data	ImageNet-1K

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご