๐ vit_l16_mim Model Card
A ViT - L16 image encoder pre - trained using Masked Image Modeling (MIM), suitable as a general - purpose feature extractor or backbone for downstream tasks.
๐ Quick Start
This vit_l16_mim model is a ViT - L16 image encoder pre - trained via Masked Image Modeling (MIM). It hasn't been fine - tuned for specific classification tasks and can be used as a general - purpose feature extractor or a backbone for downstream tasks such as object detection, segmentation, or custom classification.
โจ Features
- General - Purpose Use: Can be used as a feature extractor or backbone for various downstream tasks.
- Diverse Training Data: Trained on a large and diverse dataset of about 11M images.
๐ฆ Installation
The README doesn't provide installation steps, so this section is skipped.
๐ป Usage Examples
Basic Usage
import torch
import birder
from PIL import Image
(net, model_info) = birder.load_pretrained_model("vit_l16_mim_400", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
embedding = net.embedding(input_tensor)
๐ Documentation
Model Details
Property |
Details |
Model Type |
Image encoder |
Params (M) |
303.3 |
Input image size |
224 x 224 |
Training Data |
Trained on a diverse dataset of approximately 11M images, including: - iNaturalist 2021 (~3.3M) - WebVision - 2.0 (~1.5M random subset) - imagenet - w21 - webp - wds (~1M random subset) - SA - 1B (~220K random subset of 20 chunks) - COCO (~120K) - NABirds (~48K) - Birdsnap v1.1 (~44K) - CUB - 200 2011 (~18K) - The Birder dataset (~5M, private dataset) |
Papers |
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929 - Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377 |
๐ License
This project is licensed under the Apache - 2.0 license.
๐ Citation
@misc{dosovitskiy2021imageworth16x16words,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year={2021},
eprint={2010.11929},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2010.11929},
}
@misc{he2021maskedautoencodersscalablevision,
title={Masked Autoencoders Are Scalable Vision Learners},
author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollรกr and Ross Girshick},
year={2021},
eprint={2111.06377},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2111.06377},
}