vit_l16_mimオープンソース画像エンコーダ - 一般的な特徴抽出と下流タスクに無料で使用可能

ホーム

Vit L16 Mim

birder-projectによって開発

マスク画像モデリング(MIM)で事前学習されたViT-L16画像エンコーダーで、汎用特徴抽出や下流タスクに適しています

画像分類

PyTorch

オープンソースライセンス:Apache-2.0 #汎用画像特徴抽出 #マスク画像モデリング事前学習 #鳥類識別最適化

ダウンロード数 73

リリース時間 : 1/24/2025

モデル概要

このモデルはVision Transformerアーキテクチャに基づく画像エンコーダーで、マスク画像モデリングで事前学習されており、特定の分類タスク用に微調整されていません。物体検出、セグメンテーション、またはカスタム分類タスクのバックボーンネットワークとして適しています。

モデル特徴

マスク画像モデリング事前学習

自己教師ありのマスク画像モデリング手法で事前学習されており、より汎用的な画像特徴表現を学習できます

大規模多様データセット

約1100万枚の多様な画像で訓練されており、自然風景や鳥類など多分野のデータをカバーしています

汎用特徴抽出

特定タスク用に微調整されていないため、様々な視覚タスクのバックボーンネットワークとして使用可能

モデル能力

画像特徴抽出

画像埋め込み生成

視覚表現学習

使用事例

コンピュータビジョン

鳥類識別

鳥類識別システムの特徴抽出器として使用

物体検出

物体検出モデルのバックボーンネットワークとして使用

画像セグメンテーション

画像セグメンテーションモデルのエンコーダー部分として使用

🚀 vit_l16_mimのモデルカード

マスク画像モデリング（MIM）を使用して事前学習されたViT - L16画像エンコーダです。このモデルは特定の分類タスクに対してファインチューニングされておらず、汎用的な特徴抽出器として、または物体検出、セグメンテーション、またはカスタム分類などの下流タスクのバックボーンとして使用することを目的としています。

✨ 主な機能

このモデルは、Masked Image Modeling（MIM）を用いて事前学習されたViT - L16画像エンコーダです。特定の分類タスクに対してファインチューニングされていないため、汎用的な特徴抽出器や下流タスクのバックボーンとして利用できます。

📚 ドキュメント

モデルの詳細

属性	详情
モデルタイプ	画像エンコーダ
モデル統計情報	- パラメータ数 (M): 303.3 - 入力画像サイズ: 224 x 224
データセット	約1100万枚の画像からなる多様なデータセットで学習されており、以下のものが含まれます。 - iNaturalist 2021 (~330万枚) - WebVision - 2.0 (~150万枚のランダムサブセット) - imagenet - w21 - webp - wds (~100万枚のランダムサブセット) - SA - 1B (~22万枚のランダムサブセット（20チャンク中）) - COCO (~12万枚) - NABirds (~4.8万枚) - Birdsnap v1.1 (~4.4万枚) - CUB - 200 2011 (~1.8万枚) - The Birderデータセット (~500万枚、非公開データセット)
論文	- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929 - Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377

💻 使用例

基本的な使用法

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_l16_mim_400", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 1024)

📄 ライセンス

このモデルはApache - 2.0ライセンスの下で提供されています。

📚 引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929},
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners},
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377},
}