DPT-BEiT-Large-512 Open-Source Model - Accurately Infer Fine Depth Information from a Single Image

Dpt Beit Large 512

Developed by Intel

A monocular depth estimation model based on BEiT Transformer, capable of inferring fine depth information from a single image

3D Vision

Transformers

Open Source License:MIT #Zero-shot depth estimation #BEiT backbone network #High-precision depth map

Downloads 2,794

Release Time : 11/28/2023

Model Overview

This DPT model uses the BEiT model as its backbone network, with added neck and head structures for monocular depth estimation, applicable in fields such as generative AI, 3D reconstruction, and autonomous driving.

Model Features

High-quality depth estimation

Utilizes BEiT Transformer to achieve the highest quality depth estimation results

Multi-resolution support

Offers variants like BEiT512-L, BEiT384-L, and BEiT384-B, supporting different training resolutions

Zero-shot transfer capability

Features zero-shot transfer capability with a metric value of 10.82

Model Capabilities

Monocular depth estimation

Image depth information inference

Zero-shot transfer

Use Cases

Computer vision

3D reconstruction

Infers depth information from a single image for 3D scene reconstruction

Autonomous driving

Provides environmental depth perception for autonomous driving systems

Generative AI

Supplies depth information as input for generative AI models

🚀 DPT-BEiT-Large-512: Monocular Depth Estimation Model

This model focuses on monocular depth estimation, aiming to infer detailed depth from a single image. It has wide applications in generative AI, 3D reconstruction, and autonomous driving.

🚀 Quick Start

Prerequisites

Be sure to update PyTorch and Transformers, as version mismatches can cause errors such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'". As tested by this contributor, the following versions ran correctly:

import torch
import transformers
print(torch.__version__)
print(transformers.__version__)

out: '2.2.1+cpu'
out: '4.37.2'

Installation

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Usage

Zero-shot Depth Estimation

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-512")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-512")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth

Using the Pipeline API

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-512")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]

✨ Features

Transformer Backbone: Uses the BEiT model as backbone, combined with a neck + head for monocular depth estimation.
Large-scale Training: Trained on 1.4 million images for monocular depth estimation.
Multiple Variants: Offers variants such as BEiT512-L, BEiT384-L, and BEiT384-B.

📦 Installation

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

💻 Usage Examples

Basic Usage

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-512")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-512")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth

Advanced Usage

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-512")
result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
result["depth"]

📚 Documentation

Model Overview

Monocular depth estimation aims to infer detailed depth from a single image or camera view. This DPT model uses the BEiT model as backbone and adds a neck + head on top for monocular depth estimation.

Model Details

Property	Details
Model Type	Computer Vision - Monocular Depth Estimation
Model Authors - Company	Intel
Date	March 7, 2024
Version	1
Paper or Other Resources	MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation and GitHub Repo
License	MIT
Questions or Comments	Community Tab and Intel Developers Discord

Intended Use

Intended Use	Description
Primary intended uses	You can use the raw model for zero-shot monocular depth estimation. See the model hub to look for fine-tuned versions on a task that interests you.
Primary intended users	Anyone doing monocular depth estimation
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Quantitative Analyses

Model	Square Resolution HRWSI RMSE	Square Resolution Blended MVS REL	Square Resolution ReDWeb RMSE
BEiT 384-L	0.068	0.070	0.076
Swin-L Training 1	0.0708	0.0724	0.0826
Swin-L Training 2	0.0713	0.0720	0.0831
ViT-L	0.071	0.072	0.082
---	---	---	---
Next-ViT-L-1K-6M	0.075	0.073	0.085
DeiT3-L-22K-1K	0.070	0.070	0.080
ViT-L-Hybrid	0.075	0.075	0.085
DeiT3-L	0.077	0.075	0.087
---	---	---	---
ConvNeXt-XL	0.075	0.075	0.085
ConvNeXt-L	0.076	0.076	0.087
EfficientNet-L2	0.165	0.277	0.219
---	---	---	---
ViT-L Reversed	0.071	0.073	0.081
Swin-L Equidistant	0.072	0.074	0.083
---	---	---	---

🔧 Technical Details

This DPT model was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository. The model card specifically refers to BEiT512-L in the paper, named dpt-beit-large-512.

📄 License

This model is licensed under the MIT license.

⚠️ Important Note

dpt-beit-large-512 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of dpt-beit-large-512, developers should perform safety testing.

💡 Usage Tip

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link
Intel Extension for Transformers link

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
  title     = {MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
  journal   = {CoRR},
  volume    = {abs/2307.14460},
  year      = {2021},
  url       = {https://arxiv.org/abs/2307.14460},
  eprinttype = {arXiv},
  eprint    = {2307.14460},
  timestamp = {Wed, 26 Jul 2023},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご