Vit-base-patch32-384 Open-source Image Classification Model - Achieve Efficient Image Recognition for Free

Vit Base Patch32 384

Developed by google

Vision Transformer (ViT) is an image classification model based on the Transformer architecture, achieving efficient image recognition capabilities through pre-training and fine-tuning on the ImageNet-21k and ImageNet datasets.

Image Classification Open Source License:Apache-2.0 #High-resolution image classification #Transformer vision model #ImageNet fine-tuning

Downloads 24.92k

Release Time : 3/2/2022

Model Overview

The ViT model divides images into fixed-size patches and extracts features through a Transformer encoder, making it suitable for image classification tasks. The model is pre-trained on ImageNet-21k and fine-tuned on ImageNet, supporting high-resolution image processing.

Model Features

Transformer-based image processing

Divides images into fixed-size patches and extracts features through a Transformer encoder, breaking the limitations of traditional CNNs.

High-resolution fine-tuning

Fine-tuned on ImageNet at 384x384 resolution, improving the model's classification performance on high-resolution images.

Large-scale pre-training

Pre-trained on ImageNet-21k (14 million images, 21,843 classes), learning rich image feature representations.

Model Capabilities

Image classification

Feature extraction

Use Cases

Computer vision

ImageNet image classification

Classifies images into one of the 1,000 ImageNet categories.

Performs excellently on the ImageNet dataset; specific performance metrics can be found in the original paper.

🚀 Vision Transformer (base-sized model)

The Vision Transformer (ViT) is a pre - trained model for image recognition. It first pre - trained on ImageNet - 21k and then fine - tuned on ImageNet 2012. This model can effectively extract image features and is useful for various downstream image - related tasks.

🚀 Quick Start

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It was pre - trained on a large image collection (ImageNet - 21k) at 224x224 resolution and then fine - tuned on ImageNet (ILSVRC2012) at 384x384 resolution.

Here is an example of using this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch32-384')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch32-384')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX support are coming soon, and the API of ViTFeatureExtractor might change.

✨ Features

Pre - trained on large - scale datasets: The model is pre - trained on ImageNet - 21k and fine - tuned on ImageNet 2012, enabling it to learn rich image representations.
Flexible for downstream tasks: By adding a linear layer on top of the pre - trained encoder, it can be used for various image - related downstream tasks.
High - resolution support: The model was fine - tuned at a higher resolution of 384x384, which can potentially improve performance.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch32-384')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch32-384')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Advanced Usage

There is no advanced usage example provided in the original document.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (BERT - like). It was pre - trained on ImageNet - 21k at 224x224 resolution and then fine - tuned on ImageNet (ILSVRC2012) at 384x384 resolution.

Images are presented to the model as a sequence of fixed - size patches (32x32), which are linearly embedded. A [CLS] token is added to the beginning of the sequence for classification tasks, and absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub for fine - tuned versions on tasks that interest you.

Training data

The ViT model was pre - trained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, a dataset with 1 million images and 1k classes.

Training procedure

Preprocessing

The exact preprocessing details during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224 during pre - training, 384x384 during fine - tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warm - up of 10k steps. For ImageNet, gradient clipping at global norm 1 was additionally applied. The pre - training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine - tuning, the best results are obtained with a higher resolution (384x384). Increasing the model size will generally lead to better performance.

🔧 Technical Details

The Vision Transformer (ViT) uses a transformer encoder architecture. It processes images by dividing them into fixed - size patches, linearly embedding these patches, and then adding position embeddings. By pre - training on large - scale image datasets, it can learn rich image representations, which can be used for downstream tasks by adding a linear layer on top of the pre - trained encoder.

📄 License

The model is released under the Apache - 2.0 license.

BibTeX entry and citation info

@misc{https://doi.org/10.48550/arxiv.2010.11929,
  doi = {10.48550/ARXIV.2010.11929},
  url = {https://arxiv.org/abs/2010.11929},
  author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  publisher = {arXiv},
  year = {2020},
  copyright = {arXiv.org perpetual, non - exclusive license}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large - scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li - Jia and Li, Kai and Fei - Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご