Dino-vits16 Open-Source Visual Model - Achieve Efficient Image Feature Extraction for Free

Home

Dino Vits16

Developed by facebook

A self-supervised Vision Transformer model trained using the DINO method, suitable for image feature extraction

Image Classification

Transformers

Open Source License:Apache-2.0 #Self-supervised learning #Image patch processing #Visual feature extraction

Downloads 47.32k

Release Time : 3/2/2022

Model Overview

This Vision Transformer model is pre-trained on the ImageNet-1k dataset in a self-supervised manner and can extract image features for downstream tasks

Model Features

Self-supervised learning

Trained using the DINO method for self-supervision, eliminating the need for manual data labeling

Image patch processing

Processes images by dividing them into 16x16 pixel patches

General feature extraction

Learned image representations can be transferred to various downstream vision tasks

Model Capabilities

Image feature extraction

Base model for image classification

Visual representation learning

Use Cases

Computer vision

Image classification

Fine-tune by adding a classification head on top of the pre-trained model

Object detection

Used as a feature extractor for object detection tasks

🚀 Vision Transformer (small-sized model, patch size 16) trained using DINO

A Vision Transformer (ViT) model trained with the DINO method, offering valuable image representation for various vision tasks.

🚀 Quick Start

The Vision Transformer (ViT) model presented here is trained using the DINO method. It was first introduced in the paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and initially released in this repository.

Disclaimer: The team releasing DINO did not write a model card for this model, so this model card is created by the Hugging Face team.

✨ Features

Self-supervised Pretraining: The model is pretrained on ImageNet - 1k in a self - supervised manner, learning rich image representations.
Patch - based Input: Images are processed as a sequence of fixed - size patches (16x16), enabling efficient encoding.
Feature Extraction: Suitable for extracting features for downstream vision tasks.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It is pretrained on a large collection of images, specifically ImageNet - 1k at a resolution of 224x224 pixels, in a self - supervised fashion.

Images are fed to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are also added before the sequence is fed into the Transformer encoder layers.

Note that this model does not include any fine - tuned heads.

Through pre - training, the model learns an internal representation of images. These representations can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be regarded as a representation of the entire image.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

Here is how to use this model:

from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16')
model = ViTModel.from_pretrained('facebook/dino-vits16')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2104-14294,
  author    = {Mathilde Caron and
               Hugo Touvron and
               Ishan Misra and
               Herv{\'{e}} J{\'{e}}gou and
               Julien Mairal and
               Piotr Bojanowski and
               Armand Joulin},
  title     = {Emerging Properties in Self-Supervised Vision Transformers},
  journal   = {CoRR},
  volume    = {abs/2104.14294},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.14294},
  archivePrefix = {arXiv},
  eprint    = {2104.14294},
  timestamp = {Tue, 04 May 2021 15:12:43 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Vision Transformer (small - sized model, patch size 16) trained using DINO
Training Data	ImageNet - 1k

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご