V

Vit Base Patch16 224

Developed by google
Vision Transformer model pre-trained on ImageNet-21k and fine-tuned on ImageNet for image classification tasks
Downloads 4.8M
Release Time : 3/2/2022

Model Overview

Vision Transformer (ViT) is a BERT-like transformer encoder model that processes images by dividing them into fixed-size patch sequences, suitable for image classification tasks.

Model Features

Transformer-based Vision Model
Processes images as patch sequences and utilizes transformer architecture for efficient feature extraction
Large-scale Pre-training
Pre-trained on ImageNet-21k (14 million images, 21k classes) with strong feature learning capabilities
High-resolution Processing
Supports 224x224 pixel resolution input, capable of capturing fine-grained image features

Model Capabilities

Image Classification
Feature Extraction
Visual Representation Learning

Use Cases

General Image Recognition
Object Classification
Classifies images into one of 1000 ImageNet categories
Achieves high accuracy on the ImageNet validation set
Feature Extraction
Extracts image features for downstream tasks
Can serve as a pre-trained model for other vision tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase