vit_l16_mim開源圖像編碼器 - 免費用於通用特徵提取及下游任務

Home

Vit L16 Mim

Developed by birder-project

一個使用掩碼圖像建模(MIM)預訓練的ViT-L16圖像編碼器，適用於通用特徵提取或下游任務

圖像分類

PyTorch

Open Source License:Apache-2.0 #通用圖像特徵提取 #掩碼圖像建模預訓練 #鳥類識別優化

Downloads 73

Release Time : 1/24/2025

Model Overview

該模型是基於Vision Transformer架構的圖像編碼器，通過掩碼圖像建模預訓練，未針對特定分類任務微調，適合作為目標檢測、分割或自定義分類任務的骨幹網絡。

Model Features

掩碼圖像建模預訓練

採用自監督的掩碼圖像建模方法進行預訓練，能學習到更通用的圖像特徵表示

大規模多樣化數據集

在約1100萬張多樣化圖像上訓練，涵蓋自然場景、鳥類等多領域數據

通用特徵提取

未針對特定任務微調，可作為各類視覺任務的骨幹網絡

Model Capabilities

圖像特徵提取

圖像嵌入生成

視覺表示學習

Use Cases

計算機視覺

鳥類識別

作為鳥類識別系統的特徵提取器

目標檢測

作為目標檢測模型的骨幹網絡

圖像分割

作為圖像分割模型的編碼器部分

🚀 vit_l16_mim模型卡

這是一個使用掩碼圖像建模（MIM）預訓練的ViT - L16圖像編碼器。該模型未針對特定分類任務進行微調，旨在用作通用特徵提取器或用於下游任務（如目標檢測、分割或自定義分類）的主幹網絡。

🚀 快速開始

此模型可作為通用特徵提取器或下游任務的主幹網絡，以下是使用示例。

✨ 主要特性

基於掩碼圖像建模（MIM）進行預訓練，具有強大的特徵提取能力。
未針對特定分類任務微調，通用性強，可靈活應用於多種下游任務。

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像編碼器
模型參數	參數數量（M）：303.3；輸入圖像尺寸：224 x 224
訓練數據	在約1100萬張圖像的多樣化數據集上訓練，包括：iNaturalist 2021（約330萬張）、WebVision - 2.0（約150萬張隨機子集）、imagenet - w21 - webp - wds（約100萬張隨機子集）、SA - 1B（20個塊中約22萬張隨機子集）、COCO（約12萬張）、NABirds（約4.8萬張）、Birdsnap v1.1（約4.4萬張）、CUB - 200 2011（約1.8萬張）、The Birder數據集（約500萬張，私有數據集）
引用論文	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale：https://arxiv.org/abs/2010.11929；Masked Autoencoders Are Scalable Vision Learners：https://arxiv.org/abs/2111.06377

💻 使用示例

基礎用法

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_l16_mim_400", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 1024)

📄 許可證

本項目採用Apache - 2.0許可證。

📚 引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929},
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners},
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377},
}