Convnextv2 Base.fcmae
A self-supervised feature representation model based on ConvNeXt-V2, pre-trained using the Fully Convolutional Masked Autoencoder (FCMAE) framework
Downloads 629
Release Time : 1/5/2023
Model Overview
This model is an image feature extraction backbone network without a pre-trained head, suitable for fine-tuning or feature extraction tasks. It was pre-trained on the ImageNet-1k dataset using self-supervised learning.
Model Features
Self-supervised pre-training
Utilizes the Fully Convolutional Masked Autoencoder (FCMAE) framework for self-supervised pre-training, eliminating the need for manually annotated data
Efficient feature extraction
Optimized for image feature extraction, capable of outputting multi-scale feature maps
Lightweight design
Relatively small model size (87.7M parameters) and computational load (15.4 GMACs), suitable for practical deployment
Model Capabilities
Image feature extraction
Image classification
Multi-scale feature map generation
Use Cases
Computer vision
Image classification
Can be used for image classification tasks by fine-tuning the model head to adapt to specific classification needs
Object detection
Serves as a feature extractor for object detection systems, providing high-quality feature representations
Image similarity calculation
Computes similarity between images by extracting image embedding vectors
đ ConvNeXt-V2 Base FCMAE Model
This is a self-supervised feature representation model of ConvNeXt-V2. It is pretrained with a fully convolutional masked autoencoder framework (FCMAE). This model has no pretrained head and is only useful for fine-tuning or feature extraction.
đ Quick Start
This ConvNeXt-V2 model can be used for image classification, feature map extraction, and image embeddings. You can follow the usage examples below to get started.
⨠Features
- Self-supervised feature representation.
- Pretrained with the FCMAE framework.
- Suitable for fine-tuning and feature extraction.
đĻ Installation
The model can be used with the timm
library. You can install timm
using the following command:
pip install timm
đģ Usage Examples
Basic Usage
Image Classification
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('convnextv2_base.fcmae', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
Feature Map Extraction
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'convnextv2_base.fcmae',
pretrained=True,
features_only=True,
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
for o in output:
# print shape of each feature map in output
# e.g.:
# torch.Size([1, 128, 56, 56])
# torch.Size([1, 256, 28, 28])
# torch.Size([1, 512, 14, 14])
# torch.Size([1, 1024, 7, 7])
print(o.shape)
Image Embeddings
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'convnextv2_base.fcmae',
pretrained=True,
num_classes=0, # remove classifier nn.Linear
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
# or equivalently (without needing to set num_classes=0)
output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 7, 7) shaped tensor
output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor
đ Documentation
Model Details
Property | Details |
---|---|
Model Type | Image classification / feature backbone |
Params (M) | 87.7 |
GMACs | 15.4 |
Activations (M) | 28.8 |
Image size | 224 x 224 |
Papers | ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders |
Original | https://github.com/facebookresearch/ConvNeXt-V2 |
Pretrain Dataset | ImageNet-1k |
Model Comparison
Explore the dataset and runtime metrics of this model in timm model results.
All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP.
model | top1 | top5 | img_size | param_count | gmacs | macts | samples_per_sec | batch_size |
---|---|---|---|---|---|---|---|---|
convnextv2_huge.fcmae_ft_in22k_in1k_512 | 88.848 | 98.742 | 512 | 660.29 | 600.81 | 413.07 | 28.58 | 48 |
convnextv2_huge.fcmae_ft_in22k_in1k_384 | 88.668 | 98.738 | 384 | 660.29 | 337.96 | 232.35 | 50.56 | 64 |
convnext_xxlarge.clip_laion2b_soup_ft_in1k | 88.612 | 98.704 | 256 | 846.47 | 198.09 | 124.45 | 122.45 | 256 |
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384 | 88.312 | 98.578 | 384 | 200.13 | 101.11 | 126.74 | 196.84 | 256 |
convnextv2_large.fcmae_ft_in22k_in1k_384 | 88.196 | 98.532 | 384 | 197.96 | 101.1 | 126.74 | 128.94 | 128 |
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320 | 87.968 | 98.47 | 320 | 200.13 | 70.21 | 88.02 | 283.42 | 256 |
convnext_xlarge.fb_in22k_ft_in1k_384 | 87.75 | 98.556 | 384 | 350.2 | 179.2 | 168.99 | 124.85 | 192 |
convnextv2_base.fcmae_ft_in22k_in1k_384 | 87.646 | 98.422 | 384 | 88.72 | 45.21 | 84.49 | 209.51 | 256 |
convnext_large.fb_in22k_ft_in1k_384 | 87.476 | 98.382 | 384 | 197.77 | 101.1 | 126.74 | 194.66 | 256 |
convnext_large_mlp.clip_laion2b_augreg_ft_in1k | 87.344 | 98.218 | 256 | 200.13 | 44.94 | 56.33 | 438.08 | 256 |
convnextv2_large.fcmae_ft_in22k_in1k | 87.26 | 98.248 | 224 | 197.96 | 34.4 | 43.13 | 376.84 | 256 |
convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384 | 87.138 | 98.212 | 384 | 88.59 | 45.21 | 84.49 | 365.47 | 256 |
convnext_xlarge.fb_in22k_ft_in1k | 87.002 | 98.208 | 224 | 350.2 | 60.98 | 57.5 | 368.01 | 256 |
convnext_base.fb_in22k_ft_in1k_384 | 86.796 | 98.264 | 384 | 88.59 | 45.21 | 84.49 | 366.54 | 256 |
convnextv2_base.fcmae_ft_in22k_in1k | 86.74 | 98.022 | 224 | 88.72 | 15.38 | 28.75 | 624.23 | 256 |
convnext_large.fb_in22k_ft_in1k | 86.636 | 98.028 | 224 | 197.77 | 34.4 | 43.13 | 581.43 | 256 |
convnext_base.clip_laiona_augreg_ft_in1k_384 | 86.504 | 97.97 | 384 | 88.59 | 45.21 | 84.49 | 368.14 | 256 |
convnext_base.clip_laion2b_augreg_ft_in12k_in1k | 86.344 | 97.97 | 256 | 88.59 | 20.09 | 37.55 | 816.14 | 256 |
convnextv2_huge.fcmae_ft_in1k | 86.256 | 97.75 | 224 | 660.29 | 115.0 | 79.07 | 154.72 | 256 |
convnext_small.in12k_ft_in1k_384 | 86.182 | 97.92 | 384 | 50.22 | 25.58 | 63.37 | 516.19 | 256 |
convnext_base.clip_laion2b_augreg_ft_in1k | 86.154 | 97.68 | 256 | 88.59 | 20.09 | 37.55 | 819.86 | 256 |
convnext_base.fb_in22k_ft_in1k | 85.822 | 97.866 | 224 | 88.59 | 15.38 | 28.75 | 1037.66 | 256 |
convnext_small.fb_in22k_ft_in1k_384 | 85.778 | 97.886 | 384 | 50.22 | 25.58 | 63.37 | 518.95 | 256 |
convnextv2_large.fcmae_ft_in1k | 85.742 | 97.584 | 224 | 197.96 | 34.4 | 43.13 | 375.23 | 256 |
convnext_small.in12k_ft_in1k | 85.174 | 97.506 | 224 | 50.22 | 8.71 | 21.56 | 1474.31 | 256 |
convnext_tiny.in12k_ft_in1k_384 | 85.118 | 97.608 | 384 | 28.59 | 13.14 | 39.48 | 856.76 | 256 |
convnextv2_tiny.fcmae_ft_in22k_in1k_384 | 85.112 | 97.63 | 384 | 28.64 | 13.14 | 39.48 | 491.32 | 256 |
convnextv2_base.fcmae_ft_in1k | 84.874 | 97.09 | 224 | 88.72 | 15.38 | 28.75 | 625.33 | 256 |
convnext_small.fb_in22k_ft_in1k | 84.562 | 97.394 | 224 | 50.22 | 8.71 | 21.56 | 1478.29 | 256 |
convnext_large.fb_in1k | 84.282 | 96.892 | 224 | 197.77 | 34.4 | 43.13 | 584.28 | 256 |
convnext_tiny.in12k_ft_in1k | 84.186 | 97.124 | 224 | 28.59 | 4.47 | 13.44 | 2433.7 | 256 |
convnext_tiny.fb_in22k_ft_in1k_384 | 84.084 | 97.14 | 384 | 28.59 | 13.14 | 39.48 | 862.95 | 256 |
convnextv2_tiny.fcmae_ft_in22k_in1k | 83.894 | 96.964 | 224 | 28.64 | 4.47 | 13.44 | 1452.72 | 256 |
convnext_base.fb_in1k | 83.82 | 96.746 | 224 | 88.59 | 15.38 | 28.75 | 1054.0 | 256 |
convnextv2_nano.fcmae_ft_in22k_in1k_384 | 83.37 | 96.742 | 384 | 15.62 | 7.22 | 24.61 | 801.72 | 256 |
convnext_small.fb_in1k | 83.142 | 96.434 | 224 | 50.22 | 8.71 | 21.56 | 1464.0 | 256 |
convnextv2_tiny.fcmae_ft_in1k | 82.92 | 96.284 | 224 | 28.64 | 4.47 | 13.44 | 1425.62 | 256 |
convnext_tiny.fb_in22k_ft_in1k | 82.898 | 96.616 | 224 | 28.59 | 4.47 | 13.44 | 2480.88 | 256 |
convnext_nano.in12k_ft_in1k | 82.282 | 96.344 | 224 | 15.59 | 2.46 | 8.37 | 3926.52 | 256 |
convnext_tiny_hnf.a2h_in1k | 82.216 | 95.852 | 224 | 28.59 | 4.47 | 13.44 | 2529.75 | 256 |
convnext_tiny.fb_in1k | 82.066 | 95.854 | 224 | 28.59 | 4.47 | 13.44 | 2346.26 | 256 |
convnextv2_nano.fcmae_ft_in22k_in1k | 82.03 | 96.166 | 224 | 15.62 | 2.46 | 8.37 | 2300.18 | 256 |
convnextv2_nano.fcmae_ft_in1k | 81.83 | 95.738 | 224 | 15.62 | 2.46 | 8.37 | 2321.48 | 256 |
convnext_nano_ols.d1h_in1k | 81.766 | 95.69 | 224 | 15.62 | 2.46 | 8.37 | 2321.48 | 256 |
convnextv2_base.fcmae | 81.63 | 95.594 | 224 | 88.72 | 15.38 | 28.75 | 625.33 | 256 |
convnext_small.fb_in1k | 81.562 | 95.494 | 224 | 50.22 | 8.71 | 21.56 | 1464.0 | 256 |
convnext_tiny.fb_in1k | 81.498 | 95.494 | 224 | 28.59 | 4.47 | 13.44 | 2346.26 | 256 |
convnextv2_tiny.fcmae_ft_in1k | 81.42 | 95.344 | 224 | 28.64 | 4.47 | 13.44 | 1425.62 | 256 |
convnext_nano.in12k_ft_in1k | 81.344 | 95.344 | 224 | 15.59 | 2.46 | 8.37 | 3926.52 | 256 |
convnext_tiny_hnf.a2h_in1k | 81.216 | 95.252 | 224 | 28.59 | 4.47 | 13.44 | 2529.75 | 256 |
convnextv2_nano.fcmae_ft_in22k_in1k | 81.13 | 95.252 | 224 | 15.62 | 2.46 | 8.37 | 2300.18 | 256 |
convnextv2_nano.fcmae_ft_in1k | 81.03 | 95.138 | 224 | 15.62 | 2.46 | 8.37 | 2321.48 | 256 |
đ License
This model is licensed under the CC BY-NC 4.0 license.
Nsfw Image Detection
Apache-2.0
An NSFW image classification model based on the ViT architecture, pre-trained on ImageNet-21k via supervised learning and fine-tuned on 80,000 images to distinguish between normal and NSFW content.
Image Classification
Transformers

N
Falconsai
82.4M
588
Fairface Age Image Detection
Apache-2.0
An image classification model based on Vision Transformer architecture, pre-trained on the ImageNet-21k dataset, suitable for multi-category image classification tasks
Image Classification
Transformers

F
dima806
76.6M
10
Dinov2 Small
Apache-2.0
A small-scale vision Transformer model trained using the DINOv2 method, extracting image features through self-supervised learning
Image Classification
Transformers

D
facebook
5.0M
31
Vit Base Patch16 224
Apache-2.0
Vision Transformer model pre-trained on ImageNet-21k and fine-tuned on ImageNet for image classification tasks
Image Classification
V
google
4.8M
775
Vit Base Patch16 224 In21k
Apache-2.0
A Vision Transformer model pretrained on the ImageNet-21k dataset for image classification tasks.
Image Classification
V
google
2.2M
323
Dinov2 Base
Apache-2.0
Vision Transformer model trained using the DINOv2 method, extracting image features through self-supervised learning
Image Classification
Transformers

D
facebook
1.9M
126
Gender Classification
An image classification model built with PyTorch and HuggingPics for recognizing gender in images
Image Classification
Transformers

G
rizvandwiki
1.8M
48
Vit Base Nsfw Detector
Apache-2.0
An image classification model based on Vision Transformer (ViT) architecture, specifically designed to detect whether images contain NSFW (Not Safe For Work) content.
Image Classification
Transformers

V
AdamCodd
1.2M
47
Vit Hybrid Base Bit 384
Apache-2.0
The Hybrid Vision Transformer (ViT) model combines convolutional networks and Transformer architectures for image classification tasks, excelling on ImageNet.
Image Classification
Transformers

V
google
992.28k
6
Gender Classification 2
This is an image classification model based on the PyTorch framework and generated using HuggingPics tools, specifically designed for gender classification tasks.
Image Classification
Transformers

G
rizvandwiki
906.98k
32
Featured Recommended AI Models
Š 2025AIbase