rope_vit_reg4_b14_capi-imagenet21kオープンソース画像モデル - 画像分類と検出タスクに無料で使用可能

ホーム

Rope Vit Reg4 B14 Capi Imagenet21k

birder-projectによって開発

RoPEを採用したViT画像分類モデルで、CAPI事前トレーニングとImageNet-21Kファインチューニングを経ており、画像分類や検出タスクに適しています。

画像分類

PyTorch

オープンソースライセンス:Apache-2.0 #回転位置エンコーディングViT #高解像度適応 #2段階トレーニング

ダウンロード数 40

リリース時間 : 5/10/2025

モデル概要

このモデルはVision Transformer (ViT)アーキテクチャに基づく画像分類モデルで、回転位置エンコーディング(RoPE)技術を採用しています。2段階のトレーニングプロセス（CAPI事前トレーニングとImageNet-21Kファインチューニング）により性能を最適化し、画像分類、特徴抽出、検出タスクをサポートします。

モデル特徴

回転位置エンコーディング(RoPE)

EVAスタイルの回転位置エンコーディングを採用し、異なる解像度入力時の柔軟な設定をサポートし、モデルの性能を最適化します。

2段階トレーニングプロセス

最初にCAPI事前トレーニングを行い、その後ImageNet-21Kデータセットでファインチューニングし、モデルの性能を向上させます。

マルチタスクサポート

画像分類だけでなく、特徴抽出や物体検出タスクにも使用できます。

モデル能力

画像分類

特徴抽出

物体検出

使用事例

コンピュータビジョン

鳥類識別

このモデルを使用して鳥類の画像分類と識別を行います。

画像特徴抽出

画像検索や類似度計算などの下流タスクのために画像特徴を抽出します。

物体検出

物体検出タスクのバックボーンネットワークとして使用します。

🚀 rope_vit_reg4_b14_capi-imagenet21k モデルカード

RoPE ViT画像分類モデルです。このモデルは2段階のトレーニングプロセスに従います。まずCAPI事前学習を行い、その後ImageNet-21Kデータセットで微調整します。

🚀 クイックスタート

このモデルは、画像分類や検出のバックボーンとして使用できます。以下に、モデルの設定や使用方法の基本的な手順を説明します。

✨ 主な機能

RoPE（Rotary Position Embedding）を実装した画像分類モデル。
2段階のトレーニングプロセス（CAPI事前学習とImageNet-21Kデータセットでの微調整）。
推論時や微調整時にpt_grid_sizeパラメータを設定することで、異なる解像度に対応可能。

📦 インストール

このREADMEには明示的なインストール手順が記載されていないため、このセクションを省略します。

💻 使用例

基本的な使用法

画像分類

import birder
from birder.inference.classification import infer_image

(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = "path/to/image.jpeg"  # or a PIL image, must be loaded in RGB format
(out, _) = infer_image(net, image, transform)
# out is a NumPy array with shape of (1, 19167), representing class probabilities.

画像埋め込み

import birder
from birder.inference.classification import infer_image

(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = "path/to/image.jpeg"  # or a PIL image
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 768)

検出特徴マップ

from PIL import Image
import birder

(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('neck', torch.Size([1, 768, 16, 16]))]

高度な使用法

RoPE設定

このモデルはEVAスタイルのRoPE（Rotary Position Embedding）を実装しています。トレーニング解像度（224x224）と異なる解像度で作業する場合、pt_grid_sizeパラメータを設定することでモデルの動作を最適化できます。

推論時のpt_grid_size設定:

# When running inference with a custom resolution (e.g., 336x336)
python predict.py --network rope_vit_reg4_b14 -t capi-imagenet21k --model-config '{"pt_grid_size":[16, 16]}' --size 336 ...

明示的なRoPE設定でモデルを変換する:

python tool.py convert-model --network rope_vit_reg4_b14 -t capi-imagenet21k --add-config '{"pt_grid_size":[16, 16]}'

📚 ドキュメント

RoPE設定

より高い解像度での推論や「浅い」微調整を行う場合、pt_grid_size=(16, 16)（事前学習時のデフォルトグリッドサイズ）を明示的に設定することをお勧めします。
より高い解像度で積極的な微調整を行う場合、pt_grid_sizeをNoneのままにして、モデルが新しい解像度に適応できるようにします。

モデル詳細

属性	详情
モデルタイプ	画像分類と検出のバックボーン
パラメータ数（M）	100.5
入力画像サイズ	224 x 224
データセット	ImageNet-21K（19167クラス）

🔧 技術詳細

このモデルは、RoPE（Rotary Position Embedding）を実装した画像分類モデルです。2段階のトレーニングプロセスを経ており、まずCAPI事前学習を行い、その後ImageNet-21Kデータセットで微調整します。

推論時や微調整時にpt_grid_sizeパラメータを設定することで、異なる解像度に対応可能です。具体的には、より高い解像度での推論や「浅い」微調整を行う場合、pt_grid_size=(16, 16)を明示的に設定することをお勧めします。一方、より高い解像度で積極的な微調整を行う場合、pt_grid_sizeをNoneのままにして、モデルが新しい解像度に適応できるようにします。

📄 ライセンス

このモデルはApache 2.0ライセンスの下で提供されています。

引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, 
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929}, 
}

@misc{heo2024rotarypositionembeddingvision,
      title={Rotary Position Embedding for Vision Transformer},
      author={Byeongho Heo and Song Park and Dongyoon Han and Sangdoo Yun},
      year={2024},
      eprint={2403.13298},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2403.13298},
}

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588}, 
}

@misc{darcet2025clusterpredictlatentpatches,
      title={Cluster and Predict Latent Patches for Improved Masked Image Modeling},
      author={Timothée Darcet and Federico Baldassarre and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2025},
      eprint={2502.08769},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.08769},
}