hiera_abswin_base_mim開源圖像模型 - 免費用於圖像特徵提取與下游任務

Home

Hiera Abswin Base Mim

Developed by birder-project

採用絕對窗口位置嵌入策略的Hiera圖像編碼器，通過掩碼圖像建模（MIM）預訓練，可作為通用特徵提取器或下游任務的骨幹網絡。

圖像分類

PyTorch

Open Source License:Apache-2.0 #絕對位置嵌入 #多任務特徵提取 #鳥類識別優化

Downloads 72

Release Time : 3/20/2025

Model Overview

該模型是一個基於Hiera架構的圖像編碼器，採用絕對窗口位置嵌入策略，通過掩碼圖像建模（MIM）進行預訓練。它未針對特定分類任務進行微調，旨在作為通用特徵提取器或下游任務（如目標檢測、分割或自定義分類）的骨幹網絡使用。

Model Features

絕對窗口位置嵌入

採用創新的絕對窗口位置嵌入策略，解決了傳統窗口注意力機制中位置嵌入插值的問題

層次化視覺Transformer

基於Hiera架構，通過去蕪存菁的方式實現高效的層次化視覺特徵提取

多源訓練數據

使用包含1200萬張多樣化圖像的混合數據集進行訓練，涵蓋多個公開數據集和私有鳥類數據集

多任務適用性

可作為通用特徵提取器或下游任務（如檢測、分割）的骨幹網絡使用

Model Capabilities

圖像特徵提取

目標檢測特徵提取

圖像分割特徵提取

鳥類識別特徵提取

Use Cases

計算機視覺

鳥類識別

利用模型提取的特徵進行鳥類分類和識別

目標檢測

作為骨幹網絡用於目標檢測任務

圖像分割

作為骨幹網絡用於圖像分割任務

🚀 hiera_abswin_base_mim模型卡片

這是一個採用絕對窗口位置嵌入策略的Hiera圖像編碼器，通過掩碼圖像建模（MIM）進行預訓練。該模型未針對特定分類任務進行微調，旨在作為通用特徵提取器或用於下游任務（如目標檢測、分割或自定義分類）的骨幹網絡。

🚀 快速開始

此模型可作為通用特徵提取器或下游任務的骨幹網絡。你可以按照以下使用示例進行操作。

✨ 主要特性

採用絕對窗口位置嵌入策略的圖像編碼器。
通過掩碼圖像建模（MIM）進行預訓練。
未針對特定分類任務進行微調，適用於通用特徵提取和下游任務。

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像編碼器和檢測骨幹網絡
模型參數	參數量（M）：50.5；輸入圖像大小：224 x 224
訓練數據集	在約1200萬張圖像的多樣化數據集上訓練，包括iNaturalist 2021（約330萬張）、WebVision - 2.0（約150萬張隨機子集）、imagenet - w21 - webp - wds（約100萬張隨機子集）、SA - 1B（約22萬張隨機子集，共20個塊）、COCO（約12萬張）、NABirds（約4.8萬張）、GLDv2（約4萬張隨機子集，共6個塊）、Birdsnap v1.1（約4.4萬張）、CUB - 200 2011（約1.8萬張）以及The Birder數據集（約600萬張，私有數據集）
相關論文	Hiera: A Hierarchical Vision Transformer without the Bells - and - Whistles；Window Attention is Bugged: How not to Interpolate Position Embeddings

💻 使用示例

基礎用法

圖像嵌入

import birder
from birder.inference.classification import infer_image

(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = "path/to/image.jpeg"  # or a PIL image
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 768)

檢測特徵圖

from PIL import Image
import birder

(net, model_info) = birder.load_pretrained_model("hiera_abswin_base_mim", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('stage1', torch.Size([1, 96, 56, 56])),
#  ('stage2', torch.Size([1, 192, 28, 28])),
#  ('stage3', torch.Size([1, 384, 14, 14])),
#  ('stage4', torch.Size([1, 768, 7, 7]))]

📄 許可證

本模型採用Apache 2.0許可證。

📖 引用

@misc{ryali2023hierahierarchicalvisiontransformer,
      title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},
      author={Chaitanya Ryali and Yuan-Ting Hu and Daniel Bolya and Chen Wei and Haoqi Fan and Po-Yao Huang and Vaibhav Aggarwal and Arkabandhu Chowdhury and Omid Poursaeed and Judy Hoffman and Jitendra Malik and Yanghao Li and Christoph Feichtenhofer},
      year={2023},
      eprint={2306.00989},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.00989},
}

@misc{bolya2023windowattentionbuggedinterpolate,
      title={Window Attention is Bugged: How not to Interpolate Position Embeddings},
      author={Daniel Bolya and Chaitanya Ryali and Judy Hoffman and Christoph Feichtenhofer},
      year={2023},
      eprint={2311.05613},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2311.05613},
}