V

Vit Large Patch16 384

Developed by google
Vision Transformer (ViT) is an image classification model based on the transformer architecture, pre-trained on ImageNet-21k and fine-tuned on ImageNet.
Downloads 161.29k
Release Time : 3/2/2022

Model Overview

This model uses a transformer encoder structure, dividing images into fixed-size patches for processing, primarily for image classification tasks.

Model Features

Transformer-based vision model
Applies the successful transformer architecture from natural language processing to computer vision tasks.
Large-scale pre-training
Pre-trained on ImageNet-21k (14 million images) and fine-tuned on ImageNet (1 million images).
High-resolution processing
Uses 384x384 resolution during fine-tuning, higher than the 224x224 resolution used in pre-training.

Model Capabilities

Image classification
Feature extraction

Use Cases

Computer vision
Image classification
Classifies images into one of the 1000 ImageNet categories.
Performs excellently on the ImageNet benchmark.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase