V

Vit Hybrid Base Bit 384

Developed by google
The Hybrid Vision Transformer (ViT) model combines convolutional networks and Transformer architectures for image classification tasks, excelling on ImageNet.
Downloads 992.28k
Release Time : 12/6/2022

Model Overview

This model is a hybrid version of the Vision Transformer (ViT), achieving efficient image classification by utilizing features from a convolutional backbone network (BiT) as initial tokens for the Transformer.

Model Features

Combining Convolutional and Transformer Advantages
Extracts features through a convolutional backbone network and then inputs them into a Transformer encoder, combining local feature extraction with global relationship modeling capabilities.
Efficient Training
Significantly reduces computational resources required for training compared to pure convolutional networks while maintaining excellent performance.
High-resolution Support
Supports 384x384 resolution input, achieving optimal results when fine-tuned at high resolutions.

Model Capabilities

Image Classification
Feature Extraction

Use Cases

Computer Vision
ImageNet Image Classification
Classifies images into one of 1000 ImageNet categories.
Performs excellently on the ImageNet benchmark.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase