Model Overview
Model Features
Model Capabilities
Use Cases
đ Midas 3.1 DPT (Intel/dpt-swinv2-large-384 using Swinv2 backbone)
The DPT model, trained on 1.4 million images, is designed for monocular depth estimation. It was introduced in a paper by Ranftl et al. (2021) and initially released in a specific repository.
đ Quick Start
Model Information
- License: MIT
- Tags: vision, depth-estimation
Model Index
- Name: dpt-swinv2-large-384
- Results:
- Task:
- Type: monocular-depth-estimation
- Name: Monocular Depth Estimation
- Dataset:
- Type: MIX-6
- Name: MIX-6
- Metrics:
- Type: Zero-shot transfer
- Value: 10.82
- Name: Zero-shot transfer
- Config: Zero-shot transfer
- Verified: false
- Task:
Overview of Monocular Depth Estimation
The goal of monocular depth estimation is to infer detailed depth from a single image or camera view. It has applications in fields such as generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the under - constrained nature of the problem. Recent progress is attributed to learning - based methods, especially MiDaS, which uses dataset mixing and scale - and - shift - invariant loss. MiDaS has evolved, with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, there has been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer - based encoders along with traditional convolutional ones, aiming to comprehensively investigate depth estimation techniques. The paper focuses on integrating these backbones into MiDaS, comparing different v3.1 models, and guiding the use of future backbones with MiDaS.
The Swin Transformer, initially described in arxiv, can serve as a general - purpose backbone for computer vision. It is a hierarchical Transformer with a shifted windowing scheme that limits self - attention computation to non - overlapping local windows while allowing cross - window connection, bringing greater efficiency. It achieves strong performance on COCO object detection and ADE20K semantic segmentation, surpassing previous models.
Input Image | Output Depth Image |
---|---|
![]() |
![]() |
Videos
MiDaS Depth Estimation is a machine learning model from Intel Labs for monocular depth estimation. It was trained on up to 12 datasets and covers both in - and outdoor scenes. Multiple different MiDaS models are available, ranging from high - quality depth estimation to lightweight models for mobile downstream tasks (https://github.com/isl-org/MiDaS).
⨠Features
Model Description
This Midas 3.1 DPT model uses the SwinV2 Philosophy model as the backbone and takes a different approach to vision compared to Beit. Swin backbones focus more on using a hierarchical approach.
The previous release, MiDaS v3.0, only used the vanilla vision transformer ViT. MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next - ViT, and LeViT.
Midas 3.1 DPT Model (Swin backbone)
This model refers to Intel dpt - swinv2 - large - 384 based on the Swin backbone. The arxiv paper compares both Beit and Swin backbones. The highest - quality depth estimation is achieved using the BEiT transformer. We provide variants such as Swin - L, SwinV2 - L, SwinV2 - B, SwinV2 - T, where the numbers signify training resolutions of 512x512 and 384x384, while the letters denote large and base models respectively.
The DPT (Dense Prediction Transformer) model was trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository.
This model card specifically refers to SwinV2 in the paper and is referred to as dpt - swinv2 - large - 384. A more recent paper from 2013, specifically discussing Swin and SwinV2, is MiDaS v3.1 â A Model Zoo for Robust Monocular Relative Depth Estimation
The model card was written jointly by the Hugging Face team and Intel.
Property | Details |
---|---|
Model Authors - Company | Intel |
Date | March 18, 2024 |
Version | 1 |
Type | Computer Vision - Monocular Depth Estimation |
Paper or Other Resources | MiDaS v3.1 â A Model Zoo for Robust Monocular Relative Depth Estimation and GitHub Repo |
License | MIT |
Questions or Comments | Community Tab and Intel Developers Discord |
Property | Details |
---|---|
Primary Intended Uses | You can use the raw model for zero - shot monocular depth estimation. See the model hub to look for fine - tuned versions on a task that interests you. |
Primary Intended Users | Anyone doing monocular depth estimation |
Out - of - scope Uses | This model in most cases will need to be fine - tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people. |
đĻ Installation
Be sure to update PyTorch and Transformers, as version mismatches can generate errors such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'".
As tested by this contributor, the following versions ran correctly:
import torch
import transformers
print(torch.__version__)
print(transformers.__version__)
out: '2.2.1+cpu'
out: '4.37.2'
To Install:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
đģ Usage Examples
Basic Usage
Here is how to use this model for zero - shot depth estimation on an image:
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth
Advanced Usage
One can use the pipeline API:
from transformers import pipeline
pipe = pipeline(task="depth-estimation", model="Intel/dpt-swinv2-large-384")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]
đ Documentation
Quantitative Analyses
Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
---|---|---|---|
BEiT 384 - L | 0.068 | 0.070 | 0.076 |
Swin - L Training 1 | 0.0708 | 0.0724 | 0.0826 |
Swin - L Training 2 | 0.0713 | 0.0720 | 0.0831 |
ViT - L | 0.071 | 0.072 | 0.082 |
--- | --- | --- | --- |
Next - ViT - L - 1K - 6M | 0.075 | 0.073 | 0.085 |
DeiT3 - L - 22K - 1K | 0.070 | 0.070 | 0.080 |
ViT - L - Hybrid | 0.075 | 0.075 | 0.085 |
DeiT3 - L | 0.077 | 0.075 | 0.087 |
--- | --- | --- | --- |
ConvNeXt - XL | 0.075 | 0.075 | 0.085 |
ConvNeXt - L | 0.076 | 0.076 | 0.087 |
EfficientNet - L2 | 0.165 | 0.277 | 0.219 |
--- | --- | --- | --- |
ViT - L Reversed | 0.071 | 0.073 | 0.081 |
Swin - L Equidistant | 0.072 | 0.074 | 0.083 |
--- | --- | --- | --- |
Ethical Considerations and Limitations
dpt - swinv2 - large - 384 can produce factually incorrect output and should not be relied on to produce factually accurate information. Due to the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased, or otherwise offensive outputs.
Therefore, before deploying any applications of dpt - swinv2 - large - 384, developers should perform safety testing.
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.
Here are a couple of useful links to learn more about Intel's AI software:
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2103-13413,
author = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
title = {MiDaS v3.1 â A Model Zoo for Robust Monocular Relative Depth Estimation},
journal = {CoRR},
volume = {abs/2307.14460},
year = {2021},
url = {https://arxiv.org/abs/2307.14460},
eprinttype = {arXiv},
eprint = {2307.14460},
timestamp = {Wed, 26 Jul 2023},
biburl = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
đ License
This project is licensed under the MIT license.




