Cvt 13 384 22k
CvT-13 is a vision model combining convolution and Transformer, pre-trained on ImageNet-22k and fine-tuned on ImageNet-1k, suitable for image classification tasks.
Downloads 508
Release Time : 4/4/2022
Model Overview
This model improves visual Transformers by introducing convolutional operations, enabling efficient image classification at 384x384 resolution and supporting recognition of 1,000 ImageNet categories.
Model Features
Combination of Convolution and Transformer
Enhances traditional visual Transformers with convolutional operations to improve local feature extraction.
High-resolution processing
Supports 384x384 resolution input, suitable for fine-grained image classification.
Large-scale pre-training
Pre-trained on the ImageNet-22k dataset, featuring powerful representation capabilities.
Model Capabilities
Image classification
Visual feature extraction
Use Cases
Computer vision
Object recognition
Identify object categories in images (e.g., animals, daily objects)
Accurately classifies 1,000 ImageNet categories
Scene understanding
Analyze image scene content (e.g., natural landscapes, buildings)
Featured Recommended AI Models
Š 2025AIbase