vit_reg4_b16_mimオープンソース画像エンコーダ - 無料で汎用的な特徴抽出とビジュアルタスク処理を行う

ホーム

Vit Reg4 B16 Mim

birder-projectによって開発

マスク画像モデリング(MIM)事前学習済みのViT reg4画像エンコーダーで、汎用特徴抽出や下流視覚タスクに適しています

画像分類

PyTorch

オープンソースライセンス:Apache-2.0 #マスク画像モデリング事前学習 #汎用視覚特徴抽出 #鳥類画像認識

ダウンロード数 70

リリース時間 : 4/25/2025

モデル概要

これはマスク画像モデリング手法で事前学習されたVision Transformerモデルで、特定の分類タスクに微調整されていません。汎用画像特徴抽出器や、物体検出・セグメンテーションなどの下流視覚タスクのバックボーンネットワークとして使用できます

モデル特徴

マスク画像モデリング事前学習

MAE(Masked Autoencoder)手法を用いた自己教師あり事前学習により、強力な視覚表現能力を学習

レジスタ強化アーキテクチャ

ViT reg4アーキテクチャを採用し、モデル性能向上のためのレジスタトークンを含む

多様な訓練データ

約1100万枚の多様な画像で訓練され、自然風景や鳥類など様々な視覚領域をカバー

モデル能力

画像特徴抽出

視覚表現学習

下流タスクバックボーンネットワーク

使用事例

コンピュータビジョン

鳥類識別

鳥類識別システムの特徴抽出器として使用

物体検出

物体検出タスクのバックボーンネットワークとして使用

画像セグメンテーション

意味的セグメンテーションタスクのエンコーダーとして使用

🚀 vit_reg4_b16_mimのモデルカード

マスク画像モデリング（MIM）を使用して事前学習されたViT reg4画像エンコーダです。このモデルは特定の分類タスクに対してファインチューニングされておらず、汎用的な特徴抽出器や、物体検出、セグメンテーション、またはカスタム分類などの下流タスクのバックボーンとして使用することを想定しています。

🚀 クイックスタート

このモデルは、汎用的な特徴抽出器または下流タスクのバックボーンとして使用できます。特定の分類タスクに対してはファインチューニングされていません。

✨ 主な機能

マスク画像モデリング（MIM）を使用して事前学習されたViT reg4画像エンコーダ
特定の分類タスクに対してファインチューニングされていない
汎用的な特徴抽出器または下流タスクのバックボーンとして使用可能

📚 ドキュメント

モデルの詳細

属性	詳情
モデルタイプ	画像エンコーダ
モデル統計	パラメータ (M): 85.8 入力画像サイズ: 224 x 224
データセット	約1100万枚の画像を含む多様なデータセットで学習されました。 - iNaturalist 2021 (~330万枚) - WebVision-2.0 (~150万枚のランダムサブセット) - imagenet-w21-webp-wds (~100万枚のランダムサブセット) - SA-1B (~22万枚のランダムサブセット（20チャンク）) - COCO (~12万枚) - NABirds (~4.8万枚) - Birdsnap v1.1 (~4.4万枚) - CUB-200 2011 (~1.8万枚) - The Birderデータセット (~500万枚、非公開データセット)
論文	- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929 - Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588 - Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377

💻 使用例

基本的な使用法

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_reg4_b16_mim_300", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 768)

高度な使用法

import torch
import birder
from PIL import Image

# Must first download the model files
(net, cfg) = birder.load_model_with_cfg("models/vit_reg4_b16_mim.json", "models/vit_reg4_b16_mim_300.pt")
net.eval()

# Get the image size the model was trained on
size = birder.get_size_from_signature(cfg["signature"])

# Create an inference transform
transform = birder.classification_transform(size, cfg["rgb_stats"])

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, embedding_size)

📄 ライセンス

このモデルはApache 2.0ライセンスの下で提供されています。

引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, 
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929}, 
}

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588}, 
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners}, 
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377}, 
}