Open Source of EVA02 Basic Visual Model - Free for Image Classification and Feature Extraction Tasks

Eva02 Base Patch14 224.mim In22k

Developed by timm

EVA02 base version visual representation model, pre-trained on ImageNet-22k through masked image modeling, suitable for image classification and feature extraction tasks.

Image Classification

Transformers

Open Source License:MIT #Image Feature Extraction #Masked Image Modeling #High-Precision Classification

Downloads 2,834

Release Time : 3/31/2023

Model Overview

This model adopts an improved Vision Transformer architecture, incorporating techniques such as mean pooling, SwiGLU activation function, and rotary position embeddings, specifically designed for efficient image feature extraction.

Model Features

Improved Transformer Architecture

Utilizes rotary position embeddings (ROPE) and SwiGLU activation function to enhance positional awareness and nonlinear expression capabilities

Efficient Pre-training Strategy

Uses EVA-CLIP as the MIM (Masked Image Modeling) teacher model for knowledge distillation

Multi-scale Feature Support

Obtains non-pooled multi-level visual features (257×768 tensor) through the forward_features method

Model Capabilities

Image Feature Extraction

Image Classification

Visual Representation Learning

Use Cases

Computer Vision

Image Classification System

Used to build high-precision image classifiers, supporting 224×224 resolution input

Achieves 88.23% Top1 accuracy on ImageNet-1k

Feature Extraction Service

Serves as a visual feature extractor for downstream tasks (e.g., object detection, image retrieval)

Outputs 768-dimensional feature vectors

🚀 Model Card for eva02_base_patch14_224.mim_in22k

This is an EVA02 feature / representation model. It was pretrained on ImageNet - 22k with masked image modeling (using EVA - CLIP as a MIM teacher) by the paper authors.

EVA - 02 models are vision transformers that incorporate mean pooling, SwiGLU, Rotary Position Embeddings (ROPE), and extra LN in MLP (for Base & Large).

⚠️ Important Note

timm checkpoints are float32 for consistency with other models. Original checkpoints are float16 or bfloat16 in some cases, see originals if that's preferred.

✨ Features

An EVA02 feature / representation model.
Pretrained on ImageNet - 22k using masked image modeling with EVA - CLIP as a MIM teacher.
Vision transformers with mean pooling, SwiGLU, ROPE, and extra LN in MLP (for Base & Large).

📦 Installation

Since the document doesn't provide installation steps, this section is skipped.

💻 Usage Examples

Basic Usage

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('eva02_base_patch14_224.mim_in22k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

Advanced Usage

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'eva02_base_patch14_224.mim_in22k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 257, 768) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Image classification / feature backbone
Params (M)	85.8
GMACs	23.2
Activations (M)	36.6
Image size	224 x 224
Papers	- EVA - 02: A Visual Representation for Neon Genesis: https://arxiv.org/abs/2303.11331 - EVA - CLIP: Improved Training Techniques for CLIP at Scale: https://arxiv.org/abs/2303.15389
Original	- https://github.com/baaivision/EVA - https://huggingface.co/Yuxin - CV/EVA - 02
Pretrain Dataset	ImageNet - 22k

Model Comparison

Explore the dataset and runtime metrics of this model in timm model results.

model	top1	top5	param_count	img_size
eva02_large_patch14_448.mim_m38m_ft_in22k_in1k	90.054	99.042	305.08	448
eva02_large_patch14_448.mim_in22k_ft_in22k_in1k	89.946	99.01	305.08	448
eva_giant_patch14_560.m30m_ft_in22k_in1k	89.792	98.992	1014.45	560
eva02_large_patch14_448.mim_in22k_ft_in1k	89.626	98.954	305.08	448
eva02_large_patch14_448.mim_m38m_ft_in1k	89.57	98.918	305.08	448
eva_giant_patch14_336.m30m_ft_in22k_in1k	89.56	98.956	1013.01	336
eva_giant_patch14_336.clip_ft_in1k	89.466	98.82	1013.01	336
eva_large_patch14_336.in22k_ft_in22k_in1k	89.214	98.854	304.53	336
eva_giant_patch14_224.clip_ft_in1k	88.882	98.678	1012.56	224
eva02_base_patch14_448.mim_in22k_ft_in22k_in1k	88.692	98.722	87.12	448
eva_large_patch14_336.in22k_ft_in1k	88.652	98.722	304.53	336
eva_large_patch14_196.in22k_ft_in22k_in1k	88.592	98.656	304.14	196
eva02_base_patch14_448.mim_in22k_ft_in1k	88.23	98.564	87.12	448
eva_large_patch14_196.in22k_ft_in1k	87.934	98.504	304.14	196
eva02_small_patch14_336.mim_in22k_ft_in1k	85.74	97.614	22.13	336
eva02_tiny_patch14_336.mim_in22k_ft_in1k	80.658	95.524	5.76	336

🔧 Technical Details

Since the document doesn't have specific technical implementation details over 50 words, this section is skipped.

📄 License

The model is licensed under the MIT license.

@article{EVA02,
  title={EVA-02: A Visual Representation for Neon Genesis},
  author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2303.11331},
  year={2023}
}

@article{EVA-CLIP,
  title={EVA-02: A Visual Representation for Neon Genesis},
  author={Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2303.15389},
  year={2023}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご