Vit-large-patch16-224 Open-source Image Classification Model - Efficiently and Accurately Complete Image Classification Tasks

Vit Large Patch16 224

Developed by google

Large-scale image classification model based on Transformer architecture, pre-trained and fine-tuned on ImageNet-21k and ImageNet-1k datasets

Image Classification Open Source License:Apache-2.0 #Large-scale Image Classification #Transformer Architecture #ImageNet Fine-tuning

Downloads 188.47k

Release Time : 3/2/2022

Model Overview

Vision Transformer (ViT) is an image classification model based on Transformer encoders, processing images by dividing them into fixed-size patches. This model is pre-trained on ImageNet-21k and fine-tuned on ImageNet-1k, suitable for image classification tasks.

Model Features

Transformer-based Visual Processing

Divides images into sequences of 16x16 patches and processes them using a BERT-like Transformer architecture

Large-scale Pre-training

Pre-trained on the ImageNet-21k dataset containing 14 million images

High-resolution Support

Supports 224x224 pixel input resolution, with better performance achievable at higher resolutions (384x384)

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

Image Classification

Classifies images into 1000 ImageNet categories

Excellent performance on ImageNet benchmarks

Feature Extraction

Extracts image features for downstream tasks

🚀 Vision Transformer (large-sized model)

A Vision Transformer (ViT) model pre-trained on ImageNet-21k and fine-tuned on ImageNet 2012 for image classification.

🚀 Quick Start

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at the same resolution, 224x224.

# Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-large-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

✨ Features

Pretrained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.
Fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224.
Can be used for image classification tasks.

📚 Documentation

Model description

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.

Intended uses & limitations

You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.

Training data

The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes.

Training procedure

Preprocessing

The exact details of preprocessing of images during training/validation can be found here.

Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

💻 Usage Examples

Basic Usage

# Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-large-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Advanced Usage

Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change.

🔧 Technical Details

The model was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

📄 License

This project is licensed under the Apache-2.0 license.

📊 Model Information

Property	Details
Model Type	Vision Transformer (ViT)
Training Data	Pretrained on ImageNet-21k (14 million images, 21,843 classes), fine-tuned on ImageNet 2012 (1 million images, 1,000 classes)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご