V

Vit Huge Patch14 224 In21k

Developed by google
A Vision Transformer model pretrained on ImageNet-21k, featuring an extra-large architecture suitable for visual tasks like image classification.
Downloads 47.78k
Release Time : 3/2/2022

Model Overview

This Vision Transformer (ViT) model is pretrained on the ImageNet-21k dataset. It processes images by dividing them into fixed-size patch sequences, making it suitable for extracting image features for downstream tasks.

Model Features

Large-scale pretraining
Pretrained on ImageNet-21k (14 million images, 21,843 classes) to learn rich image feature representations.
Transformer architecture
Uses a BERT-like Transformer encoder architecture to process image patch sequences, breaking traditional CNN limitations.
High-resolution processing
Supports 224x224 pixel resolution input, processing images via 16x16 patch division.

Model Capabilities

Image feature extraction
Image classification

Use Cases

Computer vision
Image classification
Can be used to classify images, identifying the main objects or scenes within them.
Performs excellently on benchmarks like ImageNet (specific metrics not provided).
Feature extraction
Can serve as a feature extractor for downstream vision tasks such as object detection and image segmentation.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase