V

Vit Base Patch32 224 In21k

Developed by google
This Vision Transformer (ViT) model is pretrained on the ImageNet-21k dataset at 224x224 resolution, suitable for image classification tasks.
Downloads 35.10k
Release Time : 3/2/2022

Model Overview

The Vision Transformer (ViT) is a BERT-like transformer encoder model pretrained in a supervised manner on a large number of images, capable of extracting image features for downstream tasks.

Model Features

Transformer-based vision model
Uses a BERT-like transformer encoder architecture to process images, breaking the limitations of traditional CNNs.
Large-scale pretraining
Pretrained on the ImageNet-21k dataset (14 million images, 21,843 classes) to learn rich image feature representations.
Flexible downstream applications
Can extract pretrained features for various downstream vision tasks such as image classification, object detection, etc.

Model Capabilities

Image feature extraction
Image classification
Visual representation learning

Use Cases

Computer vision
Image classification
Add a classification head on top of the pretrained model for various image classification tasks.
Performs excellently on benchmark datasets like ImageNet
Visual feature extraction
Extracts high-level image feature representations for other vision tasks like object detection, image segmentation, etc.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase