Vit Huge Patch14 224.orig In21k
Large-scale image feature extraction model based on Vision Transformer (ViT) architecture, pre-trained on ImageNet-21k dataset
Downloads 3,214
Release Time : 12/22/2022
Model Overview
This is a Vision Transformer model without a classification head, primarily used for image feature extraction and downstream task fine-tuning. The model uses 14x14 patch size and 224x224 input resolution.
Model Features
Large-scale pre-training
Pre-trained on ImageNet-21k dataset containing 21,000 classes, with powerful feature extraction capabilities
Transformer architecture
Uses pure Transformer architecture for image processing, eliminating the need for traditional CNN convolution operations
High-resolution processing
Supports 224x224 pixel input resolution with 14x14 patch size
Flexible application
Can be used as a feature extractor or for downstream task fine-tuning, supports removal of classification head
Model Capabilities
Image feature extraction
Image classification
Transfer learning
Computer vision tasks
Use Cases
Computer vision
Image classification
Used for large-scale image classification tasks with 21,000 classes
Feature extraction
Extract image features for downstream tasks such as object detection, image segmentation, etc.
Transfer learning
Fine-tune the model on domain-specific datasets to adapt to specific task requirements
Featured Recommended AI Models
Š 2025AIbase