đ hiera_abswin_base_mim Model Card
This is an image encoder named Hiera, which uses an absolute window position embedding strategy and is pre - trained via Masked Image Modeling (MIM). This model has not been fine - tuned for a specific classification task. It is designed to serve as a general - purpose feature extractor or a backbone for downstream tasks such as object detection, segmentation, or custom classification.
đ Quick Start
The model can be used for image embeddings and detection feature map extraction. See the "đģ Usage Examples" section for detailed code examples.
⨠Features
- General - Purpose Feature Extractor: Can be used as a general - purpose tool for extracting image features.
- Backbone for Downstream Tasks: Suitable as a backbone for various downstream tasks like object detection, segmentation, or custom classification.
đĻ Installation
The original README does not provide installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
Image Embeddings
import birder
from birder.inference.classification import infer_image
(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = "path/to/image.jpeg"
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
Advanced Usage
Detection Feature Map
from PIL import Image
import birder
(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
print([(k, v.size()) for k, v in features.items()])
đ Documentation
Model Details
Property |
Details |
Model Type |
Image encoder and detection backbone |
Model Stats |
Params (M): 50.5; Input image size: 224 x 224 |
Training Data |
Trained on a diverse dataset of approximately 12M images, including: - iNaturalist 2021 (~3.3M) - WebVision - 2.0 (~1.5M random subset) - imagenet - w21 - webp - wds (~1M random subset) - SA - 1B (~220K random subset of 20 chunks) - COCO (~120K) - NABirds (~48K) - GLDv2 (~40K random subset of 6 chunks) - Birdsnap v1.1 (~44K) - CUB - 200 2011 (~18K) - The Birder dataset (~6M, private dataset) |
Papers |
- Hiera: A Hierarchical Vision Transformer without the Bells - and - Whistles: https://arxiv.org/abs/2306.00989 - Window Attention is Bugged: How not to Interpolate Position Embeddings: https://arxiv.org/abs/2311.05613 |
đ§ Technical Details
The model uses a Hiera architecture with an absolute window position embedding strategy and is pre - trained using Masked Image Modeling (MIM). It has not been fine - tuned for specific classification tasks and is suitable for general - purpose feature extraction and as a backbone for downstream tasks.
đ License
This project is licensed under the Apache 2.0 license.
đ Citation
@misc{ryali2023hierahierarchicalvisiontransformer,
title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},
author={Chaitanya Ryali and Yuan-Ting Hu and Daniel Bolya and Chen Wei and Haoqi Fan and Po-Yao Huang and Vaibhav Aggarwal and Arkabandhu Chowdhury and Omid Poursaeed and Judy Hoffman and Jitendra Malik and Yanghao Li and Christoph Feichtenhofer},
year={2023},
eprint={2306.00989},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2306.00989},
}
@misc{bolya2023windowattentionbuggedinterpolate,
title={Window Attention is Bugged: How not to Interpolate Position Embeddings},
author={Daniel Bolya and Chaitanya Ryali and Judy Hoffman and Christoph Feichtenhofer},
year={2023},
eprint={2311.05613},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2311.05613},
}