Model Overview
Model Features
Model Capabilities
Use Cases
đ DPT-BEiT-Large-512: Monocular Depth Estimation Model
This model focuses on monocular depth estimation, aiming to infer detailed depth from a single image. It has wide applications in generative AI, 3D reconstruction, and autonomous driving.
đ Quick Start
Prerequisites
Be sure to update PyTorch and Transformers, as version mismatches can cause errors such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'". As tested by this contributor, the following versions ran correctly:
import torch
import transformers
print(torch.__version__)
print(transformers.__version__)
out: '2.2.1+cpu'
out: '4.37.2'
Installation
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Usage
Zero-shot Depth Estimation
from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-512")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-512")
# prepare image for the model
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth
# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
)
# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth
Using the Pipeline API
from transformers import pipeline
pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-512")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]
⨠Features
- Transformer Backbone: Uses the BEiT model as backbone, combined with a neck + head for monocular depth estimation.
- Large-scale Training: Trained on 1.4 million images for monocular depth estimation.
- Multiple Variants: Offers variants such as BEiT512-L, BEiT384-L, and BEiT384-B.
đĻ Installation
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
đģ Usage Examples
Basic Usage
from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-512")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-512")
# prepare image for the model
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth
# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
)
# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth
Advanced Usage
from transformers import pipeline
pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-512")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]
đ Documentation
Model Overview
Monocular depth estimation aims to infer detailed depth from a single image or camera view. This DPT model uses the BEiT model as backbone and adds a neck + head on top for monocular depth estimation.
Model Details
Property | Details |
---|---|
Model Type | Computer Vision - Monocular Depth Estimation |
Model Authors - Company | Intel |
Date | March 7, 2024 |
Version | 1 |
Paper or Other Resources | MiDaS v3.1 â A Model Zoo for Robust Monocular Relative Depth Estimation and GitHub Repo |
License | MIT |
Questions or Comments | Community Tab and Intel Developers Discord |
Intended Use
Intended Use | Description |
---|---|
Primary intended uses | You can use the raw model for zero-shot monocular depth estimation. See the model hub to look for fine-tuned versions on a task that interests you. |
Primary intended users | Anyone doing monocular depth estimation |
Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people. |
Quantitative Analyses
Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
---|---|---|---|
BEiT 384-L | 0.068 | 0.070 | 0.076 |
Swin-L Training 1 | 0.0708 | 0.0724 | 0.0826 |
Swin-L Training 2 | 0.0713 | 0.0720 | 0.0831 |
ViT-L | 0.071 | 0.072 | 0.082 |
--- | --- | --- | --- |
Next-ViT-L-1K-6M | 0.075 | 0.073 | 0.085 |
DeiT3-L-22K-1K | 0.070 | 0.070 | 0.080 |
ViT-L-Hybrid | 0.075 | 0.075 | 0.085 |
DeiT3-L | 0.077 | 0.075 | 0.087 |
--- | --- | --- | --- |
ConvNeXt-XL | 0.075 | 0.075 | 0.085 |
ConvNeXt-L | 0.076 | 0.076 | 0.087 |
EfficientNet-L2 | 0.165 | 0.277 | 0.219 |
--- | --- | --- | --- |
ViT-L Reversed | 0.071 | 0.073 | 0.081 |
Swin-L Equidistant | 0.072 | 0.074 | 0.083 |
--- | --- | --- | --- |
đ§ Technical Details
This DPT model was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository. The model card specifically refers to BEiT512-L in the paper, named dpt-beit-large-512.
đ License
This model is licensed under the MIT license.
â ī¸ Important Note
dpt-beit-large-512 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of dpt-beit-large-512, developers should perform safety testing.
đĄ Usage Tip
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software:
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2103-13413,
author = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
title = {MiDaS v3.1 â A Model Zoo for Robust Monocular Relative Depth Estimation},
journal = {CoRR},
volume = {abs/2307.14460},
year = {2021},
url = {https://arxiv.org/abs/2307.14460},
eprinttype = {arXiv},
eprint = {2307.14460},
timestamp = {Wed, 26 Jul 2023},
biburl = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}




