Open-source sapiens-depth-0.3b vision model, focusing on human-centered vision tasks, available for free!

Sapiens Depth 0.3b

Developed by facebook

Sapiens is a Vision Transformer model pre-trained on 300 million high-resolution human images, specializing in human-centric vision tasks.

3D Vision English#High-Resolution Depth Estimation #Human Image Specialized #Synthetic Data Generalization

Downloads 24

Release Time : 9/10/2024

Model Overview

This model is used for relative depth estimation of human images, supports 1K high-resolution inference, and demonstrates exceptional generalization on real-world data.

Model Features

High-Resolution Support

Natively supports 1K high-resolution inference, suitable for image sizes of 1024x768.

Exceptional Generalization

Performs well on real-world data even with scarce annotations or fully synthetic training.

Efficient Computation

Computational cost of 1.242 trillion FLOPs, balancing performance and efficiency.

Model Capabilities

Human Image Depth Estimation

High-Resolution Image Processing

Use Cases

Computer Vision

Human Depth Perception

Used for estimating relative depth in human images, applicable to augmented reality and virtual reality applications.

Demonstrates exceptional generalization in real-world scenarios.

🚀 Depth-Sapiens-0.3B

A vision transformer model for depth estimation on human images, offering high - resolution inference and strong generalization.

🚀 Quick Start

The Depth - Sapiens - 0.3B model is designed for depth estimation on human images. It's part of the Sapiens family of vision transformers, which are pretrained on a large dataset of human images.

✨ Features

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution. When finetuned for human - centric vision tasks, these pretrained models can generalize well to in - the - wild conditions.
Sapiens - 0.3B natively supports 1K high - resolution inference. The resulting models show remarkable generalization to in - the - wild data, even when labeled data is scarce or entirely synthetic.

📚 Documentation

Model Details

Developed by: Meta
Model type: Vision Transformer
License: Creative Commons Attribution - NonCommercial 4.0
Task: depth
Format: original
File: sapiens_0.3b_render_people_epoch_100.pth

Property	Details
Model Type	Vision Transformer
Training Data	300 million human images at 1024 x 1024 resolution
License	Creative Commons Attribution - NonCommercial 4.0
Task	Depth estimation
File	sapiens_0.3b_render_people_epoch_100.pth

Model Card

Property	Details
Image Size	1024 x 768 (H x W)
Num Parameters	0.336 B
FLOPs	1.242 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1024
Num Layers	24
Num Heads	16
Feedforward Channels	4096

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: https://huggingface.co/spaces/facebook/sapiens-depth
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens
Additional Results: https://rawalkhirodkar.github.io/sapiens
HuggingFace Collection: https://huggingface.co/collections/facebook/sapiens-66d22047daa6402d565cb2fc

💻 Usage Examples

Basic Usage

The Depth 0.3B model can be used to estimate relative depth on human images.

📄 License

This model is released under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご