Deit-base-distilled-patch16-224 Open-Source Image Model - Free for High-Precision Image Recognition and Processing

Deit Base Distilled Patch16 224

Developed by facebook

The distilled version of the Efficient Data Image Transformer (DeiT) model was pre-trained and fine-tuned on ImageNet-1k at 224x224 resolution, extracting knowledge from a teacher model through distillation learning.

Image Classification

Transformers

Open Source License:Apache-2.0 #Distilled Vision Transformer #High-precision Image Classification #ImageNet Pre-training

Downloads 35.53k

Release Time : 3/2/2022

Model Overview

This model is a distilled version of the Vision Transformer (ViT), learning from a teacher CNN model using distillation tokens, suitable for image classification tasks.

Model Features

Distillation Learning

Learns from a teacher CNN model using distillation tokens to enhance model performance.

Efficient Training

Trained for 3 days on a single 8-GPU node at a resolution of 224x224.

High-Resolution Support

Supports 384x384 resolution, further improving classification accuracy.

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

ImageNet Image Classification

Classifies images into one of the 1000 ImageNet categories.

Top-1 accuracy 83.4%, Top-5 accuracy 96.5%.

🚀 Distilled Data-efficient Image Transformer (base-sized model)

The Distilled Data-efficient Image Transformer (DeiT) is pre - trained and fine - tuned on ImageNet - 1k, offering high - performance image classification capabilities.

🚀 Quick Start

The Distilled Data-efficient Image Transformer (DeiT) model is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at a resolution of 224x224. It was first introduced in the paper Training data-efficient image transformers & distillation through attention by Touvron et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman.

Disclaimer: The team releasing DeiT did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

Model description

This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.

Intended uses & limitations

You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.

📦 Installation

No specific installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoFeatureExtractor, DeiTForImageClassificationWithTeacher
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-base-distilled-patch16-224')
model = DeiTForImageClassificationWithTeacher.from_pretrained('facebook/deit-base-distilled-patch16-224')

inputs = feature_extractor(images=image, return_tensors="pt")

# forward pass
outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Advanced Usage

Since this model is a distilled ViT model, you can plug it into DeiTModel, DeiTForImageClassification or DeiTForImageClassificationWithTeacher. Note that the model expects the data to be prepared using DeiTFeatureExtractor. Here we use AutoFeatureExtractor, which will automatically use the appropriate feature extractor given the model name.

📚 Documentation

Training data

This model was pretrained and fine-tuned with distillation on ImageNet-1k, a dataset consisting of 1 million images and 1k classes.

Training procedure

Preprocessing

The exact details of preprocessing of images during training/validation can be found here.

At inference time, images are resized/rescaled to the same resolution (256x256), center-cropped at 224x224 and normalized across the RGB channels with the ImageNet mean and standard deviation.

Pretraining

The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For all hyperparameters (such as batch size and learning rate) we refer to table 9 of the original paper.

Evaluation results

Property	Details
Model Type	Distilled Vision Transformer (ViT)
Training Data	ImageNet - 1k (1 million images, 1,000 classes)

Model	ImageNet top-1 accuracy	ImageNet top-5 accuracy	# params	URL
DeiT-tiny	72.2	91.1	5M	https://huggingface.co/facebook/deit-tiny-patch16-224
DeiT-small	79.9	95.0	22M	https://huggingface.co/facebook/deit-small-patch16-224
DeiT-base	81.8	95.6	86M	https://huggingface.co/facebook/deit-base-patch16-224
DeiT-tiny distilled	74.5	91.9	6M	https://huggingface.co/facebook/deit-tiny-distilled-patch16-224
DeiT-small distilled	81.2	95.4	22M	https://huggingface.co/facebook/deit-small-distilled-patch16-224
DeiT-base distilled	83.4	96.5	87M	https://huggingface.co/facebook/deit-base-distilled-patch16-224
DeiT-base 384	82.9	96.2	87M	https://huggingface.co/facebook/deit-base-patch16-384
DeiT-base distilled 384 (1000 epochs)	85.2	97.2	88M	https://huggingface.co/facebook/deit-base-distilled-patch16-384

Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

BibTeX entry and citation info

@misc{touvron2021training,
      title={Training data-efficient image transformers & distillation through attention}, 
      author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou},
      year={2021},
      eprint={2012.12877},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{wu2020visual,
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
      year={2020},
      eprint={2006.03677},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}

📄 License

The model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご