vit_reg4_b16_mim Open-Source Image Encoder - Free for General Feature Extraction and Visual Task Processing

Vit Reg4 B16 Mim

Developed by birder-project

ViT reg4 image encoder pretrained with Masked Image Modeling (MIM), suitable for general feature extraction or downstream vision tasks

Image Classification

PyTorch

Open Source License:Apache-2.0 #Masked Image Modeling Pretraining #General Visual Feature Extraction #Bird Image Recognition

Downloads 70

Release Time : 4/25/2025

Model Overview

This is a Vision Transformer model pretrained using masked image modeling approach, not fine-tuned for specific classification tasks, can serve as a general image feature extractor or backbone network for downstream vision tasks (e.g., object detection, segmentation)

Model Features

Masked Image Modeling Pretraining

Utilizes MAE (Masked Autoencoder) method for self-supervised pretraining to learn powerful visual representations

Adopts ViT reg4 architecture incorporating register tokens to enhance model performance

Diverse Training Data

Trained on approximately 11 million diverse images covering natural scenes, birds, and other visual domains

Model Capabilities

Image Feature Extraction

Visual Representation Learning

Backbone Network for Downstream Tasks

Use Cases

Computer Vision

Bird Recognition

Used as feature extractor for bird recognition systems

Object Detection

Serves as backbone network for object detection tasks

Image Segmentation

Functions as encoder for semantic segmentation tasks

🚀 Model Card for vit_reg4_b16_mim

A ViT reg4 image encoder pre-trained using Masked Image Modeling (MIM). This model serves as a general - purpose feature extractor or backbone for downstream tasks without fine - tuning for specific classification.

🚀 Quick Start

The vit_reg4_b16_mim model is a pre - trained image encoder using Masked Image Modeling (MIM). It can be used as a feature extractor or backbone for various downstream tasks.

✨ Features

General - Purpose: This model has not been fine - tuned for a specific classification task and can be used as a general - purpose feature extractor or backbone for downstream tasks like object detection, segmentation, or custom classification.
Diverse Training Data: Trained on a diverse dataset of approximately 11M images from multiple sources.

📚 Documentation

Model Details

Property	Details
Model Type	Image encoder
Params (M)	85.8
Input image size	224 x 224
Dataset	Trained on a diverse dataset of approximately 11M images, including iNaturalist 2021 (~3.3M), WebVision - 2.0 (~1.5M random subset), imagenet - w21 - webp - wds (~1M random subset), SA - 1B (~220K random subset of 20 chunks), COCO (~120K), NABirds (~48K), Birdsnap v1.1 (~44K), CUB - 200 2011 (~18K), The Birder dataset (~5M, private dataset)
Papers	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929 Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588 Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377

Model Usage

💻 Usage Examples

Basic Usage

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_reg4_b16_mim_300", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 768)

Advanced Usage

import torch
import birder
from PIL import Image

# Must first download the model files
(net, cfg) = birder.load_model_with_cfg("models/vit_reg4_b16_mim.json", "models/vit_reg4_b16_mim_300.pt")
net.eval()

# Get the image size the model was trained on
size = birder.get_size_from_signature(cfg["signature"])

# Create an inference transform
transform = birder.classification_transform(size, cfg["rgb_stats"])

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, embedding_size)

📄 License

This model is licensed under the apache - 2.0 license.

📖 Citation

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, 
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929}, 
}

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588}, 
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners}, 
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご