V

Vit Large Patch32 224 In21k

Developed by google
This Vision Transformer (ViT) model is pre-trained on the ImageNet-21k dataset and is suitable for image classification tasks.
Downloads 4,943
Release Time : 3/2/2022

Model Overview

The Vision Transformer (ViT) is a vision model based on the Transformer architecture, pre-trained on the ImageNet-21k dataset through supervised learning, primarily used for image classification tasks.

Model Features

Large-scale pre-training
Pre-trained on the ImageNet-21k dataset (14 million images, 21,843 classes) to learn rich image representations.
Transformer architecture
Adopts a BERT-like Transformer encoder architecture, processing images by dividing them into fixed-size patches.
High-resolution support
Supports image inputs at 224x224 pixel resolution and can be extended to higher resolutions (e.g., 384x384) for better performance.

Model Capabilities

Image classification
Feature extraction

Use Cases

Computer vision
Image classification
Can be used to classify images, identifying objects or scenes within them.
Performs excellently on benchmarks like ImageNet.
Feature extraction for downstream tasks
Can serve as a feature extractor, providing foundational features for other computer vision tasks such as object detection and image segmentation.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase