Dino-vits8 Open-Source Visual Model - Free for Image Feature Extraction Tasks

Dino Vits8

Developed by facebook

A Vision Transformer model trained with self-supervised DINO method using 8x8 image patches, suitable for image feature extraction tasks

Image Classification

Transformers

Open Source License:Apache-2.0 #Self-supervised learning #Image patch embedding #Vision Transformer

Downloads 106.97k

Release Time : 3/2/2022

Model Overview

This Vision Transformer model is pretrained on ImageNet-1k dataset using DINO self-supervised method, capable of learning intrinsic image representations for downstream computer vision tasks

Model Features

Self-supervised learning

Trained with DINO self-supervised method, requiring no manual annotation

8x8 patch processing

Processes images by dividing them into 8x8 pixel patches, suitable for capturing local features

Transformer architecture

Based on Transformer encoder architecture with powerful feature extraction capabilities

Model Capabilities

Image feature extraction

Image representation learning

Foundation model for computer vision tasks

Use Cases

Computer vision

Image classification

Can serve as a foundation model by adding classification heads for image classification tasks

Object detection

Extracted image features can be used for object detection tasks

🚀 Vision Transformer (small-sized model, patch size 8) trained using DINO

A Vision Transformer (ViT) model trained with the DINO method, which can learn image representations for downstream tasks.

🚀 Quick Start

The Vision Transformer (ViT) model in this repository is trained using the DINO method. It was first introduced in the paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and initially released in this repository.

Disclaimer: The team releasing DINO did not write a model card for this model, so this model card is created by the Hugging Face team.

✨ Features

Self - Supervised Learning: Pretrained on ImageNet - 1k in a self - supervised manner, learning useful image representations.
Patch - Based Input: Processes images as a sequence of fixed - size 8x8 patches.
Feature Extraction: Can be used to extract features for downstream tasks.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT). It is pretrained on a large collection of images (ImageNet - 1k) at a resolution of 224x224 pixels in a self - supervised way.

Images are fed into the model as a sequence of fixed - size patches (resolution 8x8), which are linearly embedded. A [CLS] token is added at the start of the sequence for classification tasks. Absolute position embeddings are also added before the sequence is fed into the Transformer encoder layers.

Note that this model does not include any fine - tuned heads.

Through pre - training, the model learns an internal representation of images. These representations can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by adding a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be regarded as a representation of the entire image.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

Here is how to use this model:

from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('facebook/dino-vits8')
model = ViTModel.from_pretrained('facebook/dino-vits8')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2104-14294,
  author    = {Mathilde Caron and
               Hugo Touvron and
               Ishan Misra and
               Herv{\'{e}} J{\'{e}}gou and
               Julien Mairal and
               Piotr Bojanowski and
               Armand Joulin},
  title     = {Emerging Properties in Self-Supervised Vision Transformers},
  journal   = {CoRR},
  volume    = {abs/2104.14294},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.14294},
  archivePrefix = {arXiv},
  eprint    = {2104.14294},
  timestamp = {Tue, 04 May 2021 15:12:43 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Vision Transformer (small - sized model, patch size 8) trained using DINO
Training Data	ImageNet - 1k

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご