Convnextv2 Large.fcmae
A self-supervised feature representation model based on ConvNeXt-V2, utilizing the Fully Convolutional Masked Autoencoder (FCMAE) framework for pre-training, suitable for image classification and feature extraction tasks.
Downloads 314
Release Time : 1/5/2023
Model Overview
This model is a self-supervised pre-trained convolutional neural network primarily used for image feature extraction and fine-tuning tasks, and does not include a pre-trained head.
Model Features
Self-supervised pre-training
Utilizes the Fully Convolutional Masked Autoencoder (FCMAE) framework for pre-training, eliminating the need for large amounts of labeled data.
Efficient feature extraction
Capable of extracting multi-scale feature maps, suitable for various downstream computer vision tasks.
Large-scale parameters
Boasts 196.4 million parameters, providing powerful feature representation capabilities.
Model Capabilities
Image feature extraction
Image classification
Generating image embeddings
Use Cases
Computer Vision
Image classification
Classify images and identify the main objects within them.
Performs well on the ImageNet-1k dataset.
Feature extraction
Extract multi-level feature representations from images for downstream tasks.
Can output feature maps at different scales.
🚀 convnextv2_large.fcmae Model Card
A ConvNeXt-V2 self-supervised feature representation model. Pretrained with a fully convolutional masked autoencoder framework (FCMAE). This model has no pretrained head and is only useful for fine-tune or feature extraction.
📚 Documentation
Model Details
Property | Details |
---|---|
Model Type | Image classification / feature backbone |
Params (M) | 196.4 |
GMACs | 34.4 |
Activations (M) | 43.1 |
Image size | 224 x 224 |
Papers | ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders: https://arxiv.org/abs/2301.00808 |
Original | https://github.com/facebookresearch/ConvNeXt-V2 |
Pretrain Dataset | ImageNet-1k |
💻 Usage Examples
Basic Usage
Image Classification
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('convnextv2_large.fcmae', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
Feature Map Extraction
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'convnextv2_large.fcmae',
pretrained=True,
features_only=True,
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
for o in output:
# print shape of each feature map in output
# e.g.:
# torch.Size([1, 192, 56, 56])
# torch.Size([1, 384, 28, 28])
# torch.Size([1, 768, 14, 14])
# torch.Size([1, 1536, 7, 7])
print(o.shape)
Image Embeddings
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'convnextv2_large.fcmae',
pretrained=True,
num_classes=0, # remove classifier nn.Linear
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
# or equivalently (without needing to set num_classes=0)
output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1536, 7, 7) shaped tensor
output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor
Model Comparison
Explore the dataset and runtime metrics of this model in timm model results.
All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP.
model | top1 | top5 | img_size | param_count | gmacs | macts | samples_per_sec | batch_size |
---|---|---|---|---|---|---|---|---|
convnextv2_huge.fcmae_ft_in22k_in1k_512 | 88.848 | 98.742 | 512 | 660.29 | 600.81 | 413.07 | 28.58 | 48 |
convnextv2_huge.fcmae_ft_in22k_in1k_384 | 88.668 | 98.738 | 384 | 660.29 | 337.96 | 232.35 | 50.56 | 64 |
convnext_xxlarge.clip_laion2b_soup_ft_in1k | 88.612 | 98.704 | 256 | 846.47 | 198.09 | 124.45 | 122.45 | 256 |
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384 | 88.312 | 98.578 | 384 | 200.13 | 101.11 | 126.74 | 196.84 | 256 |
convnextv2_large.fcmae_ft_in22k_in1k_384 | 88.196 | 98.532 | 384 | 197.96 | 101.1 | 126.74 | 128.94 | 128 |
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320 | 87.968 | 98.47 | 320 | 200.13 | 70.21 | 88.02 | 283.42 | 256 |
convnext_xlarge.fb_in22k_ft_in1k_384 | 87.75 | 98.556 | 384 | 350.2 | 179.2 | 168.99 | 124.85 | 192 |
convnextv2_base.fcmae_ft_in22k_in1k_384 | 87.646 | 98.422 | 384 | 88.72 | 45.21 | 84.49 | 209.51 | 256 |
convnext_large.fb_in22k_ft_in1k_384 | 87.476 | 98.382 | 384 | 197.77 | 101.1 | 126.74 | 194.66 | 256 |
convnext_large_mlp.clip_laion2b_augreg_ft_in1k | 87.344 | 98.218 | 256 | 200.13 | 44.94 | 56.33 | 438.08 | 256 |
convnextv2_large.fcmae_ft_in22k_in1k | 87.26 | 98.248 | 224 | 197.96 | 34.4 | 43.13 | 376.84 | 256 |
convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384 | 87.138 | 98.212 | 384 | 88.59 | 45.21 | 84.49 | 365.47 | 256 |
convnext_xlarge.fb_in22k_ft_in1k | 87.002 | 98.208 | 224 | 350.2 | 60.98 | 57.5 | 368.01 | 256 |
convnext_base.fb_in22k_ft_in1k_384 | 86.796 | 98.264 | 384 | 88.59 | 45.21 | 84.49 | 366.54 | 256 |
convnextv2_base.fcmae_ft_in22k_in1k | 86.74 | 98.022 | 224 | 88.72 | 15.38 | 28.75 | 624.23 | 256 |
convnext_large.fb_in22k_ft_in1k | 86.636 | 98.028 | 224 | 197.77 | 34.4 | 43.13 | 581.43 | 256 |
convnext_base.clip_laiona_augreg_ft_in1k_384 | 86.504 | 97.97 | 384 | 88.59 | 45.21 | 84.49 | 368.14 | 256 |
convnext_base.clip_laion2b_augreg_ft_in12k_in1k | 86.344 | 97.97 | 256 | 88.59 | 20.09 | 37.55 | 816.14 | 256 |
convnextv2_huge.fcmae_ft_in1k | 86.256 | 97.75 | 224 | 660.29 | 115.0 | 79.07 | 154.72 | 256 |
convnext_small.in12k_ft_in1k_384 | 86.182 | 97.92 | 384 | 50.22 | 25.58 | 63.37 | 516.19 | 256 |
convnext_base.clip_laion2b_augreg_ft_in1k | 86.154 | 97.68 | 256 | 88.59 | 20.09 | 37.55 | 819.86 | 256 |
convnext_base.fb_in22k_ft_in1k | 85.822 | 97.866 | 224 | 88.59 | 15.38 | 28.75 | 1037.66 | 256 |
convnext_small.fb_in22k_ft_in1k_384 | 85.778 | 97.886 | 384 | 50.22 | 25.58 | 63.37 | 518.95 | 256 |
convnextv2_large.fcmae_ft_in1k | 85.742 | 97.584 | 224 | 197.96 | 34.4 | 43.13 | 375.23 | 256 |
convnext_small.in12k_ft_in1k | 85.174 | 97.506 | 224 | 50.22 | 8.71 | 21.56 | 1474.31 | 256 |
convnext_tiny.in12k_ft_in1k_384 | 85.118 | 97.608 | 384 | 28.59 | 13.14 | 39.48 | 856.76 | 256 |
convnextv2_tiny.fcmae_ft_in22k_in1k_384 | 85.112 | 97.63 | 384 | 28.64 | 13.14 | 39.48 | 491.32 | 256 |
convnextv2_base.fcmae_ft_in1k | 84.874 | 97.09 | 224 | 88.72 | 15.38 | 28.75 | 625.33 | 256 |
convnext_small.fb_in22k_ft_in1k | 84.562 | 97.394 | 224 | 50.22 | 8.71 | 21.56 | 1478.29 | 256 |
convnext_large.fb_in1k | 84.282 | 96.892 | 224 | 197.77 | 34.4 | 43.13 | 584.28 | 256 |
convnext_tiny.in12k_ft_in1k | 84.186 | 97.124 | 224 | 28.59 | 4.47 | 13.44 | 2433.7 | 256 |
convnext_tiny.fb_in22k_ft_in1k_384 | 84.084 | 97.14 | 384 | 28.59 | 13.14 | 39.48 | 862.95 | 256 |
convnextv2_tiny.fcmae_ft_in22k_in1k | 83.894 | 96.964 | 224 | 28.64 | 4.47 | 13.44 | 1452.72 | 256 |
convnext_base.fb_in1k | 83.82 | 96.746 | 224 | 88.59 | 15.38 | 28.75 | 1054.0 | 256 |
convnextv2_nano.fcmae_ft_in22k_in1k_384 | 83.37 | 96.742 | 384 | 15.62 | 7.22 | 24.61 | 801.72 | 256 |
convnext_small.fb_in1k | 83.142 | 96.434 | 224 | 50.22 | 8.71 | 21.56 | 1464.0 | 256 |
convnextv2_tiny.fcmae_ft_in1k | 82.92 | 96.284 | 224 | 28.64 | 4.47 | 13.44 | 1425.62 | 256 |
convnext_tiny.fb_in22k_ft_in1k | 82.898 | 96.616 | 224 | 28.59 | 4.47 | 13.44 | 2480.88 | 256 |
convnext_nano.in12k_ft_in1k | 82.282 | 96.344 | 224 | 15.59 | 2.46 | 8.37 | 3926.52 | 256 |
convnext_tiny_hnf.a2h_in1k | 82.216 | 95.852 | 224 | 28.59 | 4.47 | 13.44 | 2529.75 | 256 |
convnext_tiny.fb_in1k | 82.066 | 95.854 | 224 | 28.59 | 4.47 | 13.44 | 2346.26 | 256 |
convnextv2_nano.fcmae_ft_in22k_in1k | 82.03 | 96.166 | 224 | 15.62 | 2.46 | 8.37 | 2300.18 | 256 |
convnextv2_nano.fcmae_ft_in1k | 81.83 | 95.738 | 224 | 15.62 | 2.46 | 8.37 | 2321.48 | 256 |
[convnext_nano_ols.d1h_in1k](https://huggingface.co/timm/convnext_nano_ol | ... | ... | ... | ... | ... | ... | ... | ... |
📄 License
This model is licensed under the CC BY-NC 4.0 license.
Nsfw Image Detection
Apache-2.0
An NSFW image classification model based on the ViT architecture, pre-trained on ImageNet-21k via supervised learning and fine-tuned on 80,000 images to distinguish between normal and NSFW content.
Image Classification
Transformers

N
Falconsai
82.4M
588
Fairface Age Image Detection
Apache-2.0
An image classification model based on Vision Transformer architecture, pre-trained on the ImageNet-21k dataset, suitable for multi-category image classification tasks
Image Classification
Transformers

F
dima806
76.6M
10
Dinov2 Small
Apache-2.0
A small-scale vision Transformer model trained using the DINOv2 method, extracting image features through self-supervised learning
Image Classification
Transformers

D
facebook
5.0M
31
Vit Base Patch16 224
Apache-2.0
Vision Transformer model pre-trained on ImageNet-21k and fine-tuned on ImageNet for image classification tasks
Image Classification
V
google
4.8M
775
Vit Base Patch16 224 In21k
Apache-2.0
A Vision Transformer model pretrained on the ImageNet-21k dataset for image classification tasks.
Image Classification
V
google
2.2M
323
Dinov2 Base
Apache-2.0
Vision Transformer model trained using the DINOv2 method, extracting image features through self-supervised learning
Image Classification
Transformers

D
facebook
1.9M
126
Gender Classification
An image classification model built with PyTorch and HuggingPics for recognizing gender in images
Image Classification
Transformers

G
rizvandwiki
1.8M
48
Vit Base Nsfw Detector
Apache-2.0
An image classification model based on Vision Transformer (ViT) architecture, specifically designed to detect whether images contain NSFW (Not Safe For Work) content.
Image Classification
Transformers

V
AdamCodd
1.2M
47
Vit Hybrid Base Bit 384
Apache-2.0
The Hybrid Vision Transformer (ViT) model combines convolutional networks and Transformer architectures for image classification tasks, excelling on ImageNet.
Image Classification
Transformers

V
google
992.28k
6
Gender Classification 2
This is an image classification model based on the PyTorch framework and generated using HuggingPics tools, specifically designed for gender classification tasks.
Image Classification
Transformers

G
rizvandwiki
906.98k
32
Featured Recommended AI Models