DinoV2 with Registers Base Open-Source Vision Model - Free Deployment to Optimize Feature Extraction Capability

Dinov2 With Registers Base

Developed by facebook

A vision Transformer model trained with DINOv2, optimized with register tokens to enhance attention mechanisms and improve feature extraction capabilities

Image Classification

Transformers

Open Source License:Apache-2.0 #Self-supervised visual feature extraction #Register-enhanced attention #Image embedding representation

Downloads 22.74k

Release Time : 12/20/2024

Model Overview

This is a base version of Vision Transformer (ViT) with registers, trained using the DINOv2 self-supervised method, capable of extracting high-quality feature representations from images for various computer vision tasks.

Model Features

Eliminates attention map artifacts by adding dedicated register tokens, resulting in clearer attention distributions

Self-supervised learning

Trained using the DINOv2 method, capable of learning meaningful image feature representations without labeled data

Attention optimization

Improved attention mechanism provides more interpretable attention maps, aiding in understanding model decision processes

Model Capabilities

Image feature extraction

Self-supervised learning

Foundation model for computer vision tasks

Use Cases

Computer vision

Image classification

Can serve as a foundation model with added classification heads for image classification tasks

Object detection

Extracted image features can be used for object detection tasks

Image similarity calculation

Utilizes extracted feature vectors to compute similarity between images

🚀 Vision Transformer (base-sized model) trained using DINOv2, with registers

This is a Vision Transformer (ViT) model that can extract image features. It was first introduced in the paper Vision Transformers Need Registers and released in this repository.

📦 Information Table

Property	Details
Library Name	transformers
Pipeline Tag	image-feature-extraction
License	apache-2.0
Tags	dino, vision
Inference	false

🚀 Quick Start

The Vision Transformer (ViT) model was initially introduced for supervised image classification on ImageNet. Subsequently, methods were developed to enable it to perform effectively in self - supervised image feature extraction without the need for labels.

✨ Features

Artifact - free: By adding register tokens during pre - training, the model eliminates artifacts in attention maps.
Interpretable attention maps: The attention maps become more interpretable.
Improved performance: It shows enhanced performance in image feature extraction.

📚 Documentation

Model description

The Vision Transformer (ViT) is a transformer encoder model (similar to BERT) initially introduced for supervised image classification on ImageNet. Later, it was adapted for self - supervised image feature extraction.

The authors of DINOv2 found that ViTs had artifacts in attention maps due to using some image patches as “registers”. They proposed adding new “register” tokens only during pre - training, which led to artifact - free attention maps, interpretable attention maps, and improved performance.

Visualization

Visualization of attention maps of various models trained with vs. without registers. Taken from the original paper.

Note that this model does not include any fine - tuned heads. Through pre - training, the model learns an inner representation of images, which can be used for downstream tasks. For example, a linear layer can be placed on top of the pre - trained encoder to train a standard classifier.

Intended uses & limitations

You can use the raw model for feature extraction. Check the model hub for fine - tuned versions for specific tasks.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-with-registers-base')
model = AutoModel.from_pretrained('facebook/dinov2-with-registers-base')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

📄 License

This model is licensed under the apache - 2.0 license.

BibTeX entry and citation info

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588}, 
}