V

Vit Large Patch16 224

Developed by google
Large-scale image classification model based on Transformer architecture, pre-trained and fine-tuned on ImageNet-21k and ImageNet-1k datasets
Downloads 188.47k
Release Time : 3/2/2022

Model Overview

Vision Transformer (ViT) is an image classification model based on Transformer encoders, processing images by dividing them into fixed-size patches. This model is pre-trained on ImageNet-21k and fine-tuned on ImageNet-1k, suitable for image classification tasks.

Model Features

Transformer-based Visual Processing
Divides images into sequences of 16x16 patches and processes them using a BERT-like Transformer architecture
Large-scale Pre-training
Pre-trained on the ImageNet-21k dataset containing 14 million images
High-resolution Support
Supports 224x224 pixel input resolution, with better performance achievable at higher resolutions (384x384)

Model Capabilities

Image Classification
Visual Feature Extraction

Use Cases

Computer Vision
Image Classification
Classifies images into 1000 ImageNet categories
Excellent performance on ImageNet benchmarks
Feature Extraction
Extracts image features for downstream tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase