🚀 Depth Anything V2 (Fine-tuned for Metric Depth Estimation) - Transformers Version
This model is a fine-tuned variant of Depth Anything V2 designed for indoor metric depth estimation, utilizing the synthetic Hypersim datasets. The model checkpoint is compatible with the transformers library.
Depth Anything V2 was introduced in the paper of the same name by Lihe Yang et al. It shares the same architecture as the original Depth Anything release but uses synthetic data and a larger-capacity teacher model to achieve more precise and robust depth predictions. This fine-tuned version for metric depth estimation was first released in this repository.
✨ Features
- Six metric depth models: Three scales of models are available for indoor and outdoor scenes respectively.
📦 Installation
The model requires transformers>=4.45.0
. You can either install the specific version or the latest version from the source:
pip install git+https://github.com/huggingface/transformers
💻 Usage Examples
Basic Usage
Here is how to use this model to perform zero-shot depth estimation:
from transformers import pipeline
from PIL import Image
import requests
pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Metric-Indoor-Small-hf")
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
depth = pipe(image)["depth"]
Advanced Usage
Alternatively, you can use the model and processor classes:
from transformers import AutoImageProcessor, AutoModelForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("depth-anything/Depth-Anything-V2-Metric-Indoor-Small-hf")
model = AutoModelForDepthEstimation.from_pretrained("depth-anything/Depth-Anything-V2-Metric-Indoor-Small-hf")
inputs = image_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
)
For more code examples, please refer to the documentation.
📚 Documentation
Model description
Depth Anything V2 leverages the DPT architecture with a DINOv2 backbone. The model is trained on ~600K synthetic labeled images and ~62 million real unlabeled images, achieving state-of-the-art results for both relative and absolute depth estimation.

Depth Anything overview. Taken from the original paper.
Available Models
Six metric depth models of three scales for indoor and outdoor scenes are available:
📄 License
If you use this model in your research, please cite the following papers:
@article{depth_anything_v2,
title={Depth Anything V2},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
journal={arXiv:2406.09414},
year={2024}
}
@inproceedings{depth_anything_v1,
title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
booktitle={CVPR},
year={2024}
}