vit_l16_mim Open-Source Image Encoder - Free for General Feature Extraction and Downstream Tasks

Vit L16 Mim

Developed by birder-project

A ViT-L16 image encoder pretrained with Masked Image Modeling (MIM), suitable for general feature extraction or downstream tasks

Image Classification

PyTorch

Open Source License:Apache-2.0 #General Image Feature Extraction #Masked Image Modeling Pretraining #Bird Recognition Optimization

Downloads 73

Release Time : 1/24/2025

Model Overview

This model is an image encoder based on Vision Transformer architecture, pretrained via masked image modeling without fine-tuning for specific classification tasks, making it ideal as a backbone network for object detection, segmentation or custom classification tasks.

Model Features

Masked Image Modeling Pretraining

Utilizes self-supervised masked image modeling for pretraining, enabling learning of more general image feature representations

Large-scale Diverse Dataset

Trained on approximately 11 million diverse images covering multiple domains including natural scenes and birds

General Feature Extraction

Not fine-tuned for specific tasks, can serve as backbone network for various vision tasks

Model Capabilities

Image Feature Extraction

Image Embedding Generation

Visual Representation Learning

Use Cases

Computer Vision

Bird Recognition

Serves as feature extractor for bird recognition systems

Object Detection

Acts as backbone network for object detection models

Image Segmentation

Functions as encoder component for image segmentation models

🚀 vit_l16_mim Model Card

A ViT - L16 image encoder pre - trained using Masked Image Modeling (MIM), suitable as a general - purpose feature extractor or backbone for downstream tasks.

🚀 Quick Start

This vit_l16_mim model is a ViT - L16 image encoder pre - trained via Masked Image Modeling (MIM). It hasn't been fine - tuned for specific classification tasks and can be used as a general - purpose feature extractor or a backbone for downstream tasks such as object detection, segmentation, or custom classification.

✨ Features

General - Purpose Use: Can be used as a feature extractor or backbone for various downstream tasks.
Diverse Training Data: Trained on a large and diverse dataset of about 11M images.

📦 Installation

The README doesn't provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_l16_mim_400", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 1024)

📚 Documentation

Model Details

Property	Details
Model Type	Image encoder
Params (M)	303.3
Input image size	224 x 224
Training Data	Trained on a diverse dataset of approximately 11M images, including: - iNaturalist 2021 (~3.3M) - WebVision - 2.0 (~1.5M random subset) - imagenet - w21 - webp - wds (~1M random subset) - SA - 1B (~220K random subset of 20 chunks) - COCO (~120K) - NABirds (~48K) - Birdsnap v1.1 (~44K) - CUB - 200 2011 (~18K) - The Birder dataset (~5M, private dataset)
Papers	- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929 - Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377

📄 License

This project is licensed under the Apache - 2.0 license.

📄 Citation

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929},
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners},
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご