vit-large-patch32-384 Open-Source Vision Model - Free for Precise Image Classification Tasks

Vit Large Patch32 384

Developed by google

This Vision Transformer (ViT) model is pre-trained on the ImageNet-21k dataset and then fine-tuned on the ImageNet dataset, suitable for image classification tasks.

Image Classification Open Source License:Apache-2.0 #High-resolution image classification #ImageNet fine-tuning #Transformer encoder

Downloads 118.37k

Release Time : 3/2/2022

Model Overview

This model is a BERT-like Transformer encoder model, pre-trained in a supervised manner on the large-scale ImageNet-21k dataset, and subsequently fine-tuned on the higher-resolution ImageNet dataset.

Model Features

Large-scale pre-training

The model is pre-trained on the ImageNet-21k dataset (14 million images, 21,843 categories) to learn intrinsic image representations.

High-resolution fine-tuning

Fine-tuned on the ImageNet dataset at 384x384 resolution to enhance classification performance.

Transformer encoder

Uses a BERT-like Transformer encoder structure, processing images into fixed-size sequence patches with linear embeddings.

Model Capabilities

Image classification

Feature extraction

Use Cases

Image classification

ImageNet classification

Classify images into one of the 1,000 ImageNet categories.

Performs excellently on the ImageNet dataset.

🚀 Vision Transformer (large-sized model)

The Vision Transformer (ViT) is a pre - trained model for image classification. It first undergoes pre - training on ImageNet - 21k with 14 million images and 21,843 classes at a 224x224 resolution. Then, it's fine - tuned on ImageNet 2012 having 1 million images and 1,000 classes at a 384x384 resolution.

🚀 Quick Start

You can use the raw model for image classification. Check out the model hub to find fine - tuned versions for tasks that interest you.

Here is an example of using this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-large-patch32-384')
model = ViTForImageClassification.from_pretrained('google/vit-large-patch32-384')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX support will be added soon, and the API of ViTFeatureExtractor might change.

✨ Features

Powerful Representation: By pre - training on a large - scale image dataset, it learns an inner representation of images that can be used for various downstream tasks.
Flexible for Downstream Tasks: You can place a linear layer on top of the pre - trained encoder to train a standard classifier for your specific dataset.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It's pretrained on ImageNet - 21k at a 224x224 pixel resolution in a supervised way. Then, it's fine - tuned on ImageNet (ILSVRC2012) at a higher resolution of 384x384.

Images are presented to the model as a sequence of fixed - size patches (32x32 resolution), which are linearly embedded. A [CLS] token is added at the beginning of the sequence for classification tasks. Absolute position embeddings are also added before feeding the sequence to the Transformer encoder layers.

Intended uses & limitations

You can use the raw model for image classification. But note that currently, it only supports PyTorch, and support for Tensorflow and JAX/FLAX is coming soon.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

🔧 Technical Details

Training data

The ViT model was pretrained on ImageNet - 21k, a dataset with 14 million images and 21k classes, and fine - tuned on ImageNet, which has 1 million images and 1k classes.

Training procedure

Preprocessing

The detailed preprocessing steps for images during training/validation can be found here. Images are resized/rescaled to a specific resolution (224x224 for pre - training, 384x384 for fine - tuning) and normalized across the RGB channels with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5).

Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and a learning rate warm - up of 10k steps. For ImageNet, the authors found it beneficial to apply gradient clipping at a global norm of 1. The pre - training resolution is 224.

Evaluation results

For evaluation results on several image classification benchmarks, refer to tables 2 and 5 of the original paper. Note that for fine - tuning, better results are achieved at a higher resolution (384x384), and increasing the model size leads to better performance.

📄 License

This model is licensed under the Apache - 2.0 license.

BibTeX entry and citation info

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li - Jia and Li, Kai and Fei - Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご