samvit_large_patch16.sa1b開源圖像特徵模型 - 免費實現特徵提取與微調

Home

Samvit Large Patch16.sa1b

Developed by timm

Segment-Anything視覺Transformer（SAM ViT）圖像特徵模型，僅包含特徵提取和微調功能，未包含分割頭。

圖像分割

Transformers

Open Source License:Apache-2.0 #大尺寸圖像特徵提取 #SA-1B預訓練 #分割任務適配

Downloads 124

Release Time : 5/18/2023

Model Overview

該模型是基於SA-1B數據集預訓練的視覺Transformer，主要用於圖像特徵提取和微調任務，權重初始化採用MAE預訓練權重。

Model Features

大尺寸分塊處理

採用16x16的大尺寸分塊策略處理1024x1024分辨率圖像

MAE預訓練初始化

權重初始化採用MAE（Masked Autoencoder）預訓練策略

高計算效率

模型計算量為1493.9 GMACs，激活值2553.8百萬，適合大規模圖像處理

Model Capabilities

圖像特徵提取

圖像分類

圖像嵌入表示

Use Cases

計算機視覺

圖像分類

可用於圖像分類任務，提取圖像特徵後進行分類

圖像檢索

通過提取圖像嵌入特徵實現相似圖像檢索

🚀 samvit_large_patch16.sa1b模型卡片

這是一個Segment-Anything Vision Transformer（SAM ViT）圖像特徵模型（注意：用於特徵提取和微調，不包含分割頭）。由論文作者使用MAE權重初始化，在SA-1B數據集上進行分割預訓練。

🚀 快速開始

本模型是一個基於Transformer架構的圖像特徵模型，可用於圖像分類和特徵提取。下面將介紹如何使用該模型進行圖像分類和獲取圖像嵌入。

✨ 主要特性

模型類型：圖像分類/特徵骨幹網絡
模型統計信息：
- 參數數量（百萬）：308.3
- GMACs：1493.9
- 激活值數量（百萬）：2553.8
- 圖像尺寸：1024 x 1024
相關論文：
- Segment Anything: https://arxiv.org/abs/2304.02643
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
原始代碼庫：https://github.com/facebookresearch/segment-anything
預訓練數據集：SA-1B

屬性	詳情
模型類型	圖像分類/特徵骨幹網絡
預訓練數據集	SA-1B

📦 安裝指南

文檔中未提及安裝步驟，若有需要可參考timm庫的官方安裝說明。

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('samvit_large_patch16.sa1b', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'samvit_large_patch16.sa1b',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 256, 64, 64) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

你可以在timm 模型結果中探索該模型的數據集和運行時指標。

📄 許可證

本項目採用Apache-2.0許可證。

📚 引用

@article{kirillov2023segany,
  title={Segment Anything},
  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{'a}r, Piotr and Girshick, Ross},
  journal={arXiv:2304.02643},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}