V

Vit Base Patch16 384

Developed by google
Vision Transformer (ViT) is an image classification model based on the Transformer architecture, pre-trained on ImageNet-21k and fine-tuned on ImageNet.
Downloads 30.30k
Release Time : 3/2/2022

Model Overview

This model performs image classification by dividing images into fixed-size patches and applying a Transformer encoder, supporting 1,000 ImageNet categories.

Model Features

Transformer-based Image Processing
Divides images into 16x16 patches and applies a Transformer encoder, breaking the limitations of traditional CNNs in image processing.
Large-scale Pre-training
Pre-trained on ImageNet-21k (14 million images) and fine-tuned on ImageNet (1 million images), learning rich image feature representations.
High-resolution Fine-tuning
Uses 384x384 resolution during fine-tuning, capturing finer image features compared to the pre-training resolution of 224x224.

Model Capabilities

Image classification
Feature extraction

Use Cases

Computer Vision
Image Classification
Classifies input images into one of the 1,000 ImageNet categories.
Performs excellently on the ImageNet dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase