V

Vit Base Patch32 384

Developed by google
Vision Transformer (ViT) is an image classification model based on the Transformer architecture, achieving efficient image recognition capabilities through pre-training and fine-tuning on the ImageNet-21k and ImageNet datasets.
Downloads 24.92k
Release Time : 3/2/2022

Model Overview

The ViT model divides images into fixed-size patches and extracts features through a Transformer encoder, making it suitable for image classification tasks. The model is pre-trained on ImageNet-21k and fine-tuned on ImageNet, supporting high-resolution image processing.

Model Features

Transformer-based image processing
Divides images into fixed-size patches and extracts features through a Transformer encoder, breaking the limitations of traditional CNNs.
High-resolution fine-tuning
Fine-tuned on ImageNet at 384x384 resolution, improving the model's classification performance on high-resolution images.
Large-scale pre-training
Pre-trained on ImageNet-21k (14 million images, 21,843 classes), learning rich image feature representations.

Model Capabilities

Image classification
Feature extraction

Use Cases

Computer vision
ImageNet image classification
Classifies images into one of the 1,000 ImageNet categories.
Performs excellently on the ImageNet dataset; specific performance metrics can be found in the original paper.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase