DepthPro-hf Open-Source Depth Estimation Model - Free Generation of High-Resolution and High-Accuracy Depth Maps

Depthpro Hf

Developed by apple

DepthPro is a foundational model for zero-shot metric monocular depth estimation, capable of generating high-resolution, high-precision depth maps.

3D Vision

Transformers

#Zero-shot depth estimation #High-resolution depth maps #Metric scale prediction

Downloads 13.96k

Release Time : 11/27/2024

Model Overview

DepthPro is a multi-scale Vision Transformer (ViT)-based model specifically designed for monocular depth estimation tasks. It can generate high-resolution depth maps with exceptional clarity and fine-grained details, and the predictions are metric with absolute scale.

Model Features

High-resolution depth maps

Capable of generating high-resolution depth maps with exceptional clarity and fine-grained details.

Zero-shot metric

Predictions are metric with absolute scale, without relying on metadata such as camera intrinsics.

Fast processing

Generates 2.25-megapixel depth maps in under 0.3 seconds on a standard GPU.

Multi-scale Vision Transformer

Employs a multi-scale Vision Transformer (ViT) architecture combining shared Dinov2 encoder and DPT fusion stages.

Model Capabilities

Monocular depth estimation

High-resolution depth map generation

Zero-shot metric

Fast processing

Use Cases

Computer vision

Scene depth estimation

Used to estimate depth information in a single image scene.

Generates high-resolution, high-precision depth maps.

3D reconstruction

Used to reconstruct 3D scenes from a single image.

Provides accurate depth information to assist 3D modeling.

🚀 DepthPro: Monocular Depth Estimation

DepthPro is a foundation model for zero - shot metric monocular depth estimation, capable of generating high - resolution depth maps with sharpness and fine - grained details.

🚀 Quick Start

Use the code below to get started with the model.

Basic Usage

import requests
from PIL import Image
import torch
from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = 'https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg'
image = Image.open(requests.get(url, stream=True).raw)

image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)

inputs = image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

post_processed_output = image_processor.post_process_depth_estimation(
    outputs, target_sizes=[(image.height, image.width)],
)

field_of_view = post_processed_output[0]["field_of_view"]
focal_length = post_processed_output[0]["focal_length"]
depth = post_processed_output[0]["predicted_depth"]
depth = (depth - depth.min()) / (depth.max() - depth.min())
depth = depth * 255.
depth = depth.detach().cpu().numpy()
depth = Image.fromarray(depth.astype("uint8"))

✨ Features

DepthPro is a foundation model for zero - shot metric monocular depth estimation, designed to generate high - resolution depth maps with remarkable sharpness and fine - grained details. It employs a multi - scale Vision Transformer (ViT) - based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch - level features are merged, upsampled, and refined using a DPT - like fusion stage, enabling precise depth estimation.

The abstract from the paper is the following:

We present a foundation model for zero - shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high - resolution depth maps with unparalleled sharpness and high - frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25 - megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi - scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state - of - the - art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions.

This is the model card of a 🤗 transformers model that has been pushed on the Hub.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Developed by: Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun.
Model type: DepthPro
License: Apple - ASCL

Model Sources

HF Docs: DepthPro
Repository: https://github.com/apple/ml - depth - pro
Paper: https://arxiv.org/abs/2410.02073

Training Details

Training Data

The DepthPro model was trained on the following datasets:

image/jpeg

Preprocessing

Images go through the following preprocessing steps:

rescaled by 1/225.
normalized with mean=[0.5, 0.5, 0.5] and std=[0.5, 0.5, 0.5]
resized to 1536x1536 pixels

Training Hyperparameters

image/jpeg

Evaluation

image/png

Model Architecture and Objective

image/png

The DepthProForDepthEstimation model uses a DepthProEncoder, for encoding the input image and a FeatureFusionStage for fusing the output features from encoder.

The DepthProEncoder further uses two encoders:

patch_encoder
- Input image is scaled with multiple ratios, as specified in the scaled_images_ratios configuration.
- Each scaled image is split into smaller patches of size patch_size with overlapping areas determined by scaled_images_overlap_ratios.
- These patches are processed by the patch_encoder
image_encoder
- Input image is also rescaled to patch_size and processed by the image_encoder

Both these encoders can be configured via patch_model_config and image_model_config respectively, both of which are separate Dinov2Model by default.

Outputs from both encoders (last_hidden_state) and selected intermediate states (hidden_states) from patch_encoder are fused by a DPT - based FeatureFusionStage for depth estimation.

The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task - specific features from a separate ViT image encoder to predict the horizontal angular field - of - view.

Citation

BibTeX:

@misc{bochkovskii2024depthprosharpmonocular,
      title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
      author={Aleksei Bochkovskii and Amaël Delaunoy and Hugo Germain and Marcel Santos and Yichao Zhou and Stephan R. Richter and Vladlen Koltun},
      year={2024},
      eprint={2410.02073},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.02073},
}

Model Card Authors

Armaghan Shakir

🔧 Technical Details

The technical details are scattered throughout the "Model Details" and "Evaluation" sections, where the model architecture, how it processes images, and the components involved in depth estimation are described.

📄 License

The model is under the Apple - ASCL license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご