🚀 Model Card for vit_reg4_b16_mim
A ViT reg4 image encoder pre-trained using Masked Image Modeling (MIM). This model serves as a general - purpose feature extractor or backbone for downstream tasks without fine - tuning for specific classification.
🚀 Quick Start
The vit_reg4_b16_mim
model is a pre - trained image encoder using Masked Image Modeling (MIM). It can be used as a feature extractor or backbone for various downstream tasks.
✨ Features
- General - Purpose: This model has not been fine - tuned for a specific classification task and can be used as a general - purpose feature extractor or backbone for downstream tasks like object detection, segmentation, or custom classification.
- Diverse Training Data: Trained on a diverse dataset of approximately 11M images from multiple sources.
📚 Documentation
Model Details
Property |
Details |
Model Type |
Image encoder |
Params (M) |
85.8 |
Input image size |
224 x 224 |
Dataset |
Trained on a diverse dataset of approximately 11M images, including iNaturalist 2021 (~3.3M), WebVision - 2.0 (~1.5M random subset), imagenet - w21 - webp - wds (~1M random subset), SA - 1B (~220K random subset of 20 chunks), COCO (~120K), NABirds (~48K), Birdsnap v1.1 (~44K), CUB - 200 2011 (~18K), The Birder dataset (~5M, private dataset) |
Papers |
|
Model Usage
💻 Usage Examples
Basic Usage
import torch
import birder
from PIL import Image
(net, model_info) = birder.load_pretrained_model("vit_reg4_b16_mim_300", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
embedding = net.embedding(input_tensor)
Advanced Usage
import torch
import birder
from PIL import Image
(net, cfg) = birder.load_model_with_cfg("models/vit_reg4_b16_mim.json", "models/vit_reg4_b16_mim_300.pt")
net.eval()
size = birder.get_size_from_signature(cfg["signature"])
transform = birder.classification_transform(size, cfg["rgb_stats"])
image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
embedding = net.embedding(input_tensor)
📄 License
This model is licensed under the apache - 2.0
license.
📖 Citation
@misc{dosovitskiy2021imageworth16x16words,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year={2021},
eprint={2010.11929},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2010.11929},
}
@misc{darcet2024visiontransformersneedregisters,
title={Vision Transformers Need Registers},
author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
year={2024},
eprint={2309.16588},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2309.16588},
}
@misc{he2021maskedautoencodersscalablevision,
title={Masked Autoencoders Are Scalable Vision Learners},
author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
year={2021},
eprint={2111.06377},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2111.06377},
}