Sapiens-2B Open-Source Vision Model - Supports 1K Resolution Human Depth Estimation with Good Generalization in Real Scenarios

Sapiens Depth 2b Bfloat16

Developed by facebook

Sapiens-2B is a vision Transformer model pre-trained on 300 million high-resolution human images, specifically optimized for human depth estimation tasks, supporting 1K resolution inference with excellent generalization capabilities in real-world scenarios.

3D Vision English#Human Depth Estimation #High-Resolution Vision #2.1B Parameter Large Model

Downloads 17

Release Time : 9/10/2024

Model Overview

This model is a 2.1-billion-parameter vision Transformer developed by Meta for relative depth estimation tasks in human images, performing exceptionally well in both synthetic and real-world data scenarios.

Model Features

High-Resolution Support

Natively supports 1024×1024 resolution input, capable of processing human images up to 1024×768 in size.

Synthetic Data Generalization

Maintains excellent generalization capabilities for real-world data even when trained entirely on synthetic data.

Efficient Computation

Optimized with bfloat16 data format, achieving 8.709 trillion floating-point operations.

Model Capabilities

Human Depth Estimation

High-Resolution Image Processing

Transfer Learning from Synthetic to Real-World Scenarios

Use Cases

Virtual Reality

3D Human Modeling

Generates human depth information from a single image for 3D modeling.

Can produce accurate relative depth maps.

Film Special Effects

Depth-Aware Effects

Provides human depth information for post-production in films.

Supports more realistic depth-of-field effects and virtual scene integration.

🚀 Depth-Sapiens-2B-Bfloat16

This model is designed for depth estimation on human images, offering high - resolution inference and strong generalization capabilities.

🚀 Quick Start

The Depth - Sapiens - 2B - Bfloat16 model is a powerful tool for depth estimation tasks. It is part of the Sapiens family of vision transformers, which have been pretrained on a large dataset of human images.

✨ Features

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution. The pretrained models, when finetuned for human - centric vision tasks, generalize well to in - the - wild conditions.
Sapiens - 2B natively supports 1K high - resolution inference. The resulting models show remarkable generalization to in - the - wild data, even when labeled data is scarce or entirely synthetic.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Details

Developed by: Meta
Model type: Vision Transformer
License: Creative Commons Attribution - NonCommercial 4.0
Task: depth
Format: bfloat16
File: sapiens_2b_render_people_epoch_25_bfloat16.pt2

Property	Details
Model Type	Vision Transformer
Developed by	Meta
License	Creative Commons Attribution - NonCommercial 4.0
Task	depth
Format	bfloat16
File	sapiens_2b_render_people_epoch_25_bfloat16.pt2

Model Card

Property	Details
Image Size	1024 x 768 (H x W)
Num Parameters	2.163 B
FLOPs	8.709 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1920
Num Layers	48
Num Heads	32
Feedforward Channels	7680

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: https://huggingface.co/spaces/facebook/sapiens-depth
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens
Additional Results: https://rawalkhirodkar.github.io/sapiens
HuggingFace Collection: https://huggingface.co/collections/facebook/sapiens-66d22047daa6402d565cb2fc

💻 Usage Examples

Basic Usage

The Depth 2B model can be used to estimate relative depth on human images.

# No specific code example provided in the original document

📄 License

This model is released under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご