Open-source Image Classification Model of Vit-Hybrid-Base-Bit-384: A Great Collaboration between Convolution and Transformer

Vit Hybrid Base Bit 384

Developed by google

The Hybrid Vision Transformer (ViT) model combines convolutional networks and Transformer architectures for image classification tasks, excelling on ImageNet.

Image Classification

Transformers

Open Source License:Apache-2.0 #Image Classification #Hybrid Architecture #Large-scale Pretraining

Downloads 992.28k

Release Time : 12/6/2022

Model Overview

This model is a hybrid version of the Vision Transformer (ViT), achieving efficient image classification by utilizing features from a convolutional backbone network (BiT) as initial tokens for the Transformer.

Model Features

Combining Convolutional and Transformer Advantages

Extracts features through a convolutional backbone network and then inputs them into a Transformer encoder, combining local feature extraction with global relationship modeling capabilities.

Efficient Training

Significantly reduces computational resources required for training compared to pure convolutional networks while maintaining excellent performance.

High-resolution Support

Supports 384x384 resolution input, achieving optimal results when fine-tuned at high resolutions.

Model Capabilities

Image Classification

Feature Extraction

Use Cases

Computer Vision

ImageNet Image Classification

Classifies images into one of 1000 ImageNet categories.

Performs excellently on the ImageNet benchmark.

🚀 Vision Transformer (base-sized model) - Hybrid

The hybrid Vision Transformer (ViT) is a model that combines a convolutional backbone with a Transformer encoder, achieving excellent results in image classification tasks with fewer computational resources.

🚀 Quick Start

The hybrid Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy et al. It's the first paper that successfully trains a Transformer encoder on ImageNet, outperforming familiar convolutional architectures. ViT hybrid is a variant of the plain Vision Transformer, using a convolutional backbone (BiT) to generate initial "tokens" for the Transformer.

Disclaimer: The team releasing ViT did not write a model card for this model, so this model card is written by the Hugging Face team.

✨ Features

Breakthrough in Vision: Demonstrates that a pure Transformer on image patches can excel in image classification, reducing reliance on CNNs.
Resource - Efficient: Achieves excellent results on multiple benchmarks with fewer computational resources for training.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTHybridImageProcessor, ViTHybridForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = ViTHybridImageProcessor.from_pretrained('google/vit-hybrid-base-bit-384')
model = ViTHybridForImageClassification.from_pretrained('google/vit-hybrid-base-bit-384')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
>>> tabby, tabby cat

For more code examples, refer to the documentation.

📚 Documentation

Model description

While the Transformer architecture is the standard in natural language processing, its use in computer vision was limited. This model shows that a pure Transformer on image patches can perform well in image classification. When pre - trained on large data and transferred to various benchmarks, it outperforms state - of - the - art CNNs with fewer training resources.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub for fine - tuned versions for your specific task.

🔧 Technical Details

Training data

The ViT - Hybrid model was pretrained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, which has 1 million images and 1k classes.

Training procedure

Preprocessing

The exact preprocessing details during training/validation are in here. Images are resized/rescaled to 224x224 and normalized with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5) across RGB channels.

Pretraining

The model was trained on TPUv3 hardware (8 cores). All variants are trained with a batch size of 4096 and a 10k - step learning rate warmup. For ImageNet, gradient clipping at global norm 1 is beneficial. The training resolution is 224.

Evaluation results

Refer to tables 2 and 5 of the original paper for evaluation results on several image classification benchmarks. Higher resolution (384x384) yields better fine - tuning results, and larger model sizes perform better.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Hybrid Vision Transformer
Training Data	Pretrained on ImageNet - 21k, fine - tuned on ImageNet

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご