DPT-SwinV2-Large-384 Open-Source Model - Accurate Monocular Depth Estimation, Trained with 1.4 Million Images

Home

Dpt Swinv2 Large 384

Developed by Intel

DPT model based on SwinV2 backbone network for monocular depth estimation, trained on 1.4 million images

3D Vision

Transformers

Open Source License:MIT #Monocular depth estimation #Zero-shot transfer #SwinV2 backbone

Downloads 84

Release Time : 12/10/2023

Model Overview

This model is the DPT (Dense Prediction Transformer) model from MiDaS version 3.1, specifically designed for estimating depth information from a single image. It adopts the SwinV2 architecture as its backbone network, suitable for applications such as generative AI, 3D reconstruction, and autonomous driving.

Model Features

Based on SwinV2 backbone network

Utilizes a hierarchical transformer architecture with shifted window computation for improved efficiency, ideal for visual tasks

Large-scale training data

Trained on 1.4 million images covering diverse scenarios

Zero-shot transfer capability

Supports zero-shot depth estimation without the need for fine-tuning for specific scenarios

Model Capabilities

Monocular depth estimation

Zero-shot transfer

Image depth analysis

Use Cases

Computer vision

3D scene reconstruction

Generates depth information from a single image for 3D scene modeling

Produces detailed depth maps

Autonomous driving

Provides environmental depth perception for autonomous driving systems

Assists vehicles in perceiving their surroundings

Augmented reality

Delivers scene depth information for AR applications

Enables more realistic virtual object overlays

🚀 Midas 3.1 DPT (Intel/dpt-swinv2-large-384 using Swinv2 backbone)

The DPT model, trained on 1.4 million images, is designed for monocular depth estimation. It was introduced in a paper by Ranftl et al. (2021) and initially released in a specific repository.

🚀 Quick Start

Model Information

License: MIT
Tags: vision, depth-estimation

Model Index

Name: dpt-swinv2-large-384
Results:
- Task:
  - Type: monocular-depth-estimation
  - Name: Monocular Depth Estimation
- Dataset:
  - Type: MIX-6
  - Name: MIX-6
- Metrics:
  - Type: Zero-shot transfer
  - Value: 10.82
  - Name: Zero-shot transfer
  - Config: Zero-shot transfer
  - Verified: false

Overview of Monocular Depth Estimation

The goal of monocular depth estimation is to infer detailed depth from a single image or camera view. It has applications in fields such as generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the under - constrained nature of the problem. Recent progress is attributed to learning - based methods, especially MiDaS, which uses dataset mixing and scale - and - shift - invariant loss. MiDaS has evolved, with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, there has been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer - based encoders along with traditional convolutional ones, aiming to comprehensively investigate depth estimation techniques. The paper focuses on integrating these backbones into MiDaS, comparing different v3.1 models, and guiding the use of future backbones with MiDaS.

The Swin Transformer, initially described in arxiv, can serve as a general - purpose backbone for computer vision. It is a hierarchical Transformer with a shifted windowing scheme that limits self - attention computation to non - overlapping local windows while allowing cross - window connection, bringing greater efficiency. It achieves strong performance on COCO object detection and ADE20K semantic segmentation, surpassing previous models.

Input Image	Output Depth Image

Videos

MiDaS Depth Estimation is a machine learning model from Intel Labs for monocular depth estimation. It was trained on up to 12 datasets and covers both in - and outdoor scenes. Multiple different MiDaS models are available, ranging from high - quality depth estimation to lightweight models for mobile downstream tasks (https://github.com/isl-org/MiDaS).

✨ Features

Model Description

This Midas 3.1 DPT model uses the SwinV2 Philosophy model as the backbone and takes a different approach to vision compared to Beit. Swin backbones focus more on using a hierarchical approach. model image

The previous release, MiDaS v3.0, only used the vanilla vision transformer ViT. MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next - ViT, and LeViT.

Midas 3.1 DPT Model (Swin backbone)

This model refers to Intel dpt - swinv2 - large - 384 based on the Swin backbone. The arxiv paper compares both Beit and Swin backbones. The highest - quality depth estimation is achieved using the BEiT transformer. We provide variants such as Swin - L, SwinV2 - L, SwinV2 - B, SwinV2 - T, where the numbers signify training resolutions of 512x512 and 384x384, while the letters denote large and base models respectively.

The DPT (Dense Prediction Transformer) model was trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository.

This model card specifically refers to SwinV2 in the paper and is referred to as dpt - swinv2 - large - 384. A more recent paper from 2013, specifically discussing Swin and SwinV2, is MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

The model card was written jointly by the Hugging Face team and Intel.

Property	Details
Model Authors - Company	Intel
Date	March 18, 2024
Version	1
Type	Computer Vision - Monocular Depth Estimation
Paper or Other Resources	MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation and GitHub Repo
License	MIT
Questions or Comments	Community Tab and Intel Developers Discord

Property	Details
Primary Intended Uses	You can use the raw model for zero - shot monocular depth estimation. See the model hub to look for fine - tuned versions on a task that interests you.
Primary Intended Users	Anyone doing monocular depth estimation
Out - of - scope Uses	This model in most cases will need to be fine - tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

📦 Installation

Be sure to update PyTorch and Transformers, as version mismatches can generate errors such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'".

As tested by this contributor, the following versions ran correctly:

import torch
import transformers
print(torch.__version__)
print(transformers.__version__)

out: '2.2.1+cpu'
out: '4.37.2'

To Install:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

💻 Usage Examples

Basic Usage

Here is how to use this model for zero - shot depth estimation on an image:

output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth

Advanced Usage

One can use the pipeline API:

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="Intel/dpt-swinv2-large-384")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]

📚 Documentation

Quantitative Analyses

Model	Square Resolution HRWSI RMSE	Square Resolution Blended MVS REL	Square Resolution ReDWeb RMSE
BEiT 384 - L	0.068	0.070	0.076
Swin - L Training 1	0.0708	0.0724	0.0826
Swin - L Training 2	0.0713	0.0720	0.0831
ViT - L	0.071	0.072	0.082
---	---	---	---
Next - ViT - L - 1K - 6M	0.075	0.073	0.085
DeiT3 - L - 22K - 1K	0.070	0.070	0.080
ViT - L - Hybrid	0.075	0.075	0.085
DeiT3 - L	0.077	0.075	0.087
---	---	---	---
ConvNeXt - XL	0.075	0.075	0.085
ConvNeXt - L	0.076	0.076	0.087
EfficientNet - L2	0.165	0.277	0.219
---	---	---	---
ViT - L Reversed	0.071	0.073	0.081
Swin - L Equidistant	0.072	0.074	0.083
---	---	---	---

Ethical Considerations and Limitations

dpt - swinv2 - large - 384 can produce factually incorrect output and should not be relied on to produce factually accurate information. Due to the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased, or otherwise offensive outputs.

Therefore, before deploying any applications of dpt - swinv2 - large - 384, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link
Intel Extension for Transformers link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
  title     = {MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
  journal   = {CoRR},
  volume    = {abs/2307.14460},
  year      = {2021},
  url       = {https://arxiv.org/abs/2307.14460},
  eprinttype = {arXiv},
  eprint    = {2307.14460},
  timestamp = {Wed, 26 Jul 2023},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご