vit_reg4_b16_mim開源圖像編碼器 - 免費進行通用特徵提取與視覺任務處理

首頁

Vit Reg4 B16 Mim

由birder-project開發

基於掩碼圖像建模(MIM)預訓練的ViT reg4圖像編碼器，適用於通用特徵提取或下游視覺任務

圖像分類

PyTorch

開源協議:Apache-2.0 #掩碼圖像建模預訓練 #通用視覺特徵提取 #鳥類圖像識別

下載量 70

發布時間 : 4/25/2025

模型概述

這是一個使用掩碼圖像建模方法預訓練的視覺Transformer模型，未針對特定分類任務微調，可作為通用圖像特徵提取器或下游視覺任務（如目標檢測、分割）的骨幹網絡

模型特點

掩碼圖像建模預訓練

採用MAE(Masked Autoencoder)方法進行自監督預訓練，學習強大的視覺表示能力

寄存器增強架構

採用ViT reg4架構，包含寄存器token以提升模型性能

多樣化訓練數據

在約1100萬張多樣化圖像上訓練，涵蓋自然場景、鳥類等多種視覺領域

模型能力

圖像特徵提取

視覺表示學習

下游任務骨幹網絡

使用案例

計算機視覺

鳥類識別

作為特徵提取器用於鳥類識別系統

目標檢測

作為骨幹網絡用於目標檢測任務

圖像分割

作為編碼器用於語義分割任務

🚀 vit_reg4_b16_mim模型卡片

這是一個使用掩碼圖像建模（MIM）預訓練的ViT reg4圖像編碼器。該模型未針對特定分類任務進行微調，旨在用作通用特徵提取器，或作為下游任務（如目標檢測、分割或自定義分類）的骨幹網絡。

🚀 快速開始

模型詳情

屬性	詳情
模型類型	圖像編碼器
模型統計信息	參數（M）：85.8；輸入圖像大小：224 x 224
訓練數據集	在約1100萬張圖像的多樣化數據集上訓練，包括：iNaturalist 2021（約330萬張）、WebVision - 2.0（約150萬張隨機子集）、imagenet - w21 - webp - wds（約100萬張隨機子集）、SA - 1B（約22萬張隨機子集，共20塊）、COCO（約12萬張）、NABirds（約4.8萬張）、Birdsnap v1.1（約4.4萬張）、CUB - 200 2011（約1.8萬張）、The Birder數據集（約500萬張，私有數據集）
引用論文	《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》：https://arxiv.org/abs/2010.11929；《Vision Transformers Need Registers》：https://arxiv.org/abs/2309.16588；《Masked Autoencoders Are Scalable Vision Learners》：https://arxiv.org/abs/2111.06377

模型使用

💻 使用示例

基礎用法

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_reg4_b16_mim_300", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 768)

高級用法

import torch
import birder
from PIL import Image

# Must first download the model files
(net, cfg) = birder.load_model_with_cfg("models/vit_reg4_b16_mim.json", "models/vit_reg4_b16_mim_300.pt")
net.eval()

# Get the image size the model was trained on
size = birder.get_size_from_signature(cfg["signature"])

# Create an inference transform
transform = birder.classification_transform(size, cfg["rgb_stats"])

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, embedding_size)

引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, 
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929}, 
}

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588}, 
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners}, 
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377}, 
}