Cvt 13
CvT-13 is a hybrid architecture model combining convolutional neural networks and vision transformers, pre-trained on the ImageNet-1k dataset, suitable for image classification tasks.
Downloads 21.80k
Release Time : 4/4/2022
Model Overview
This model improves vision transformers by introducing convolutional operations, enhancing local feature extraction while retaining the advantages of transformers, primarily used for image classification tasks.
Model Features
Convolution-Transformer Hybrid Architecture
Combines CNN's local feature extraction capability with the global modeling advantages of transformers
Efficient Image Processing
Pre-trained on ImageNet-1k, supports image classification at 224x224 resolution
Lightweight Design
Has fewer parameters and computational requirements compared to pure transformer models (specific parameter scale not disclosed)
Model Capabilities
Image Classification
Visual Feature Extraction
Use Cases
Computer Vision
General Object Recognition
Accurately classify and recognize everyday objects
Can recognize 1,000 categories in ImageNet-1k
Scene Understanding
Identify scene types in images (e.g., palaces, natural landscapes, etc.)
Featured Recommended AI Models