Vit-base-patch16-384 Open-source Image Classification Model - Free to Use for Precise Image Classification

Vit Base Patch16 384

Developed by google

Vision Transformer (ViT) is an image classification model based on the Transformer architecture, pre-trained on ImageNet-21k and fine-tuned on ImageNet.

Image Classification Open Source License:Apache-2.0 #High-resolution image classification #Transformer architecture #ImageNet fine-tuning

Downloads 30.30k

Release Time : 3/2/2022

Model Overview

This model performs image classification by dividing images into fixed-size patches and applying a Transformer encoder, supporting 1,000 ImageNet categories.

Model Features

Transformer-based Image Processing

Divides images into 16x16 patches and applies a Transformer encoder, breaking the limitations of traditional CNNs in image processing.

Large-scale Pre-training

Pre-trained on ImageNet-21k (14 million images) and fine-tuned on ImageNet (1 million images), learning rich image feature representations.

High-resolution Fine-tuning

Uses 384x384 resolution during fine-tuning, capturing finer image features compared to the pre-training resolution of 224x224.

Model Capabilities

Image classification

Feature extraction

Use Cases

Computer Vision

Image Classification

Classifies input images into one of the 1,000 ImageNet categories.

Performs excellently on the ImageNet dataset.

🚀 Vision Transformer (base-sized model)

The Vision Transformer (ViT) is a powerful model for image classification. It is pre - trained on a large - scale image dataset and fine - tuned for better performance, enabling it to extract useful features from images for downstream tasks.

🚀 Quick Start

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It was first pre - trained on ImageNet - 21k (14 million images, 21,843 classes) at a resolution of 224x224, and then fine - tuned on ImageNet 2012 (1 million images, 1,000 classes) at a resolution of 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in [this repository](https://github.com/google - research/vision_transformer).

💻 Usage Examples

Basic Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-384')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-384')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change.

✨ Features

Large - scale pre - training: Pretrained on ImageNet - 21k, which contains 14 million images and 21,843 classes, to learn rich image representations.
Fine - tuning: Fine - tuned on ImageNet 2012 at a higher resolution (384x384), improving performance on image classification tasks.
Flexible downstream use: Can be used as a feature extractor for various downstream tasks by adding a linear layer on top of the pre - trained encoder.

📦 Installation

No specific installation steps are provided in the original README.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model. Images are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. A [CLS] token is added to the beginning of the sequence for classification tasks, and absolute position embeddings are added before feeding the sequence to the Transformer encoder layers.

By pre - training, the model learns an inner representation of images. For downstream tasks, a linear layer can be placed on top of the pre - trained encoder, typically on top of the [CLS] token, as the last hidden state of this token represents the entire image.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub for fine - tuned versions on tasks that interest you.

🔧 Technical Details

Training data

The ViT model was pretrained on [ImageNet - 21k](http://www.image - net.org/), a dataset with 14 million images and 21k classes, and fine - tuned on [ImageNet](http://www.image - net.org/challenges/LSVRC/2012/), a dataset with 1 million images and 1k classes.

Training procedure

Preprocessing

The exact preprocessing details of images during training/validation can be found [here](https://github.com/google - research/vision_transformer/blob/master/vit_jax/input_pipeline.py). Images are resized/rescaled to the same resolution (224x224 during pre - training, 384x384 during fine - tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warm - up of 10k steps. For ImageNet, gradient clipping at global norm 1 was additionally applied. The pre - training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine - tuning, better results are obtained with a higher resolution (384x384), and increasing the model size will lead to better performance.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Vision Transformer (base - sized model)
Training Data	Pretrained on ImageNet - 21k (14 million images, 21,843 classes), fine - tuned on ImageNet 2012 (1 million images, 1,000 classes)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご