vit_l16_mim开源图像编码器 - 免费用于通用特征提取及下游任务

首页

Vit L16 Mim

由 birder-project 开发

一个使用掩码图像建模(MIM)预训练的ViT-L16图像编码器，适用于通用特征提取或下游任务

图像分类

PyTorch

开源协议:Apache-2.0 #通用图像特征提取 #掩码图像建模预训练 #鸟类识别优化

下载量 73

发布时间 : 1/24/2025

模型简介

该模型是基于Vision Transformer架构的图像编码器，通过掩码图像建模预训练，未针对特定分类任务微调，适合作为目标检测、分割或自定义分类任务的骨干网络。

模型特点

掩码图像建模预训练

采用自监督的掩码图像建模方法进行预训练，能学习到更通用的图像特征表示

大规模多样化数据集

在约1100万张多样化图像上训练，涵盖自然场景、鸟类等多领域数据

通用特征提取

未针对特定任务微调，可作为各类视觉任务的骨干网络

模型能力

图像特征提取

图像嵌入生成

视觉表示学习

使用案例

计算机视觉

鸟类识别

作为鸟类识别系统的特征提取器

目标检测

作为目标检测模型的骨干网络

图像分割

作为图像分割模型的编码器部分

🚀 vit_l16_mim模型卡

这是一个使用掩码图像建模（MIM）预训练的ViT - L16图像编码器。该模型未针对特定分类任务进行微调，旨在用作通用特征提取器或用于下游任务（如目标检测、分割或自定义分类）的主干网络。

🚀 快速开始

此模型可作为通用特征提取器或下游任务的主干网络，以下是使用示例。

✨ 主要特性

基于掩码图像建模（MIM）进行预训练，具有强大的特征提取能力。
未针对特定分类任务微调，通用性强，可灵活应用于多种下游任务。

📚 详细文档

模型详情

属性	详情
模型类型	图像编码器
模型参数	参数数量（M）：303.3；输入图像尺寸：224 x 224
训练数据	在约1100万张图像的多样化数据集上训练，包括：iNaturalist 2021（约330万张）、WebVision - 2.0（约150万张随机子集）、imagenet - w21 - webp - wds（约100万张随机子集）、SA - 1B（20个块中约22万张随机子集）、COCO（约12万张）、NABirds（约4.8万张）、Birdsnap v1.1（约4.4万张）、CUB - 200 2011（约1.8万张）、The Birder数据集（约500万张，私有数据集）
引用论文	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale：https://arxiv.org/abs/2010.11929；Masked Autoencoders Are Scalable Vision Learners：https://arxiv.org/abs/2111.06377

💻 使用示例

基础用法

import torch
import birder
from PIL import Image

(net, model_info) = birder.load_pretrained_model("vit_l16_mim_400", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

image = Image.open("path/to/image.jpeg")
input_tensor = transform(image).unsqueeze(dim=0)
with torch.inference_mode():
    embedding = net.embedding(input_tensor)
    # embedding is a tensor with shape of (1, 1024)

📄 许可证

本项目采用Apache - 2.0许可证。

📚 引用

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929},
}

@misc{he2021maskedautoencodersscalablevision,
      title={Masked Autoencoders Are Scalable Vision Learners},
      author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollár and Ross Girshick},
      year={2021},
      eprint={2111.06377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2111.06377},
}