V

Vit Large Patch32 384

Developed by google
This Vision Transformer (ViT) model is pre-trained on the ImageNet-21k dataset and then fine-tuned on the ImageNet dataset, suitable for image classification tasks.
Downloads 118.37k
Release Time : 3/2/2022

Model Overview

This model is a BERT-like Transformer encoder model, pre-trained in a supervised manner on the large-scale ImageNet-21k dataset, and subsequently fine-tuned on the higher-resolution ImageNet dataset.

Model Features

Large-scale pre-training
The model is pre-trained on the ImageNet-21k dataset (14 million images, 21,843 categories) to learn intrinsic image representations.
High-resolution fine-tuning
Fine-tuned on the ImageNet dataset at 384x384 resolution to enhance classification performance.
Transformer encoder
Uses a BERT-like Transformer encoder structure, processing images into fixed-size sequence patches with linear embeddings.

Model Capabilities

Image classification
Feature extraction

Use Cases

Image classification
ImageNet classification
Classify images into one of the 1,000 ImageNet categories.
Performs excellently on the ImageNet dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase