hiera_abswin_base_mim Open-source Image Model - Free for Image Feature Extraction and Downstream Tasks

Hiera Abswin Base Mim

Developed by birder-project

A Hiera image encoder employing an absolute window position embedding strategy, pre-trained via Masked Image Modeling (MIM), serving as a general-purpose feature extractor or backbone network for downstream tasks.

Image Classification

PyTorch

Open Source License:Apache-2.0 #Absolute Position Embedding #Multi-task Feature Extraction #Bird Recognition Optimization

Downloads 72

Release Time : 3/20/2025

Model Overview

This model is a Hiera-architecture-based image encoder that utilizes an absolute window position embedding strategy and is pre-trained through Masked Image Modeling (MIM). It is not fine-tuned for specific classification tasks but is designed to function as a general-purpose feature extractor or backbone network for downstream tasks such as object detection, segmentation, or custom classification.

Model Features

Absolute Window Position Embedding

Employs an innovative absolute window position embedding strategy to address the issue of position embedding interpolation in traditional window attention mechanisms.

Hierarchical Vision Transformer

Based on the Hiera architecture, it achieves efficient hierarchical visual feature extraction through a refined approach.

Multi-source Training Data

Trained on a mixed dataset comprising 12 million diverse images, covering multiple public datasets and private bird datasets.

Multi-task Applicability

Can be used as a general-purpose feature extractor or backbone network for downstream tasks such as detection and segmentation.

Model Capabilities

Image feature extraction

Object detection feature extraction

Image segmentation feature extraction

Bird recognition feature extraction

Use Cases

Computer Vision

Bird Recognition

Utilizes the model's extracted features for bird classification and identification.

Object Detection

Serves as a backbone network for object detection tasks.

Image Segmentation

Serves as a backbone network for image segmentation tasks.

🚀 hiera_abswin_base_mim Model Card

This is an image encoder named Hiera, which uses an absolute window position embedding strategy and is pre - trained via Masked Image Modeling (MIM). This model has not been fine - tuned for a specific classification task. It is designed to serve as a general - purpose feature extractor or a backbone for downstream tasks such as object detection, segmentation, or custom classification.

🚀 Quick Start

The model can be used for image embeddings and detection feature map extraction. See the "💻 Usage Examples" section for detailed code examples.

✨ Features

General - Purpose Feature Extractor: Can be used as a general - purpose tool for extracting image features.
Backbone for Downstream Tasks: Suitable as a backbone for various downstream tasks like object detection, segmentation, or custom classification.

📦 Installation

The original README does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Image Embeddings

import birder
from birder.inference.classification import infer_image

(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = "path/to/image.jpeg"  # or a PIL image
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 768)

Advanced Usage

Detection Feature Map

from PIL import Image
import birder

(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('stage1', torch.Size([1, 96, 56, 56])),
#  ('stage2', torch.Size([1, 192, 28, 28])),
#  ('stage3', torch.Size([1, 384, 14, 14])),
#  ('stage4', torch.Size([1, 768, 7, 7]))]

📚 Documentation

Model Details

Property	Details
Model Type	Image encoder and detection backbone
Model Stats	Params (M): 50.5; Input image size: 224 x 224
Training Data	Trained on a diverse dataset of approximately 12M images, including: - iNaturalist 2021 (~3.3M) - WebVision - 2.0 (~1.5M random subset) - imagenet - w21 - webp - wds (~1M random subset) - SA - 1B (~220K random subset of 20 chunks) - COCO (~120K) - NABirds (~48K) - GLDv2 (~40K random subset of 6 chunks) - Birdsnap v1.1 (~44K) - CUB - 200 2011 (~18K) - The Birder dataset (~6M, private dataset)
Papers	- Hiera: A Hierarchical Vision Transformer without the Bells - and - Whistles: https://arxiv.org/abs/2306.00989 - Window Attention is Bugged: How not to Interpolate Position Embeddings: https://arxiv.org/abs/2311.05613

🔧 Technical Details

The model uses a Hiera architecture with an absolute window position embedding strategy and is pre - trained using Masked Image Modeling (MIM). It has not been fine - tuned for specific classification tasks and is suitable for general - purpose feature extraction and as a backbone for downstream tasks.

📄 License

This project is licensed under the Apache 2.0 license.

📚 Citation

@misc{ryali2023hierahierarchicalvisiontransformer,
      title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},
      author={Chaitanya Ryali and Yuan-Ting Hu and Daniel Bolya and Chen Wei and Haoqi Fan and Po-Yao Huang and Vaibhav Aggarwal and Arkabandhu Chowdhury and Omid Poursaeed and Judy Hoffman and Jitendra Malik and Yanghao Li and Christoph Feichtenhofer},
      year={2023},
      eprint={2306.00989},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.00989},
}

@misc{bolya2023windowattentionbuggedinterpolate,
      title={Window Attention is Bugged: How not to Interpolate Position Embeddings},
      author={Daniel Bolya and Chaitanya Ryali and Judy Hoffman and Christoph Feichtenhofer},
      year={2023},
      eprint={2311.05613},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2311.05613},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご