Sapiens-depth-0.3b-bfloat16 Open Source Model - A Practical Choice for Human-Centric Visual Tasks

Sapiens Depth 0.3b Bfloat16

Developed by facebook

Sapiens is a series of vision transformer models pre-trained on 300 million human images at 1024x1024 resolution, focusing on human-centric vision tasks.

3D Vision English#High-resolution depth estimation #Human image specialized #1K resolution support

Downloads 22

Release Time : 9/10/2024

Model Overview

This model is used to estimate relative depth information in human images, supports 1K high-resolution inference, and demonstrates exceptional generalization capabilities for real-world data.

Model Features

High-resolution support

Natively supports 1K high-resolution inference, with image sizes up to 1024x768.

Strong generalization capability

Demonstrates exceptional generalization to real-world data even with scarce labeled data or fully synthetic scenarios.

Efficient computation

Computational load of 1.242 TFLOPs with 336 million parameters, balancing performance and efficiency.

Model Capabilities

Depth estimation

High-resolution image processing

Human image analysis

Use Cases

Computer vision

Human image depth estimation

Used to estimate relative depth information in human images, suitable for virtual reality, augmented reality, and similar scenarios.

Demonstrates outstanding generalization capabilities in complex scenes.

🚀 Depth-Sapiens-0.3B-Bfloat16

Sapiens is a vision transformer family pretrained on 300 million high - resolution human images. This model can estimate relative depth on human images.

🚀 Quick Start

The Depth-Sapiens-0.3B-Bfloat16 model is designed for depth estimation on human images. It's based on the Sapiens family of vision transformers, which are pretrained on a large - scale human image dataset.

✨ Features

Sapiens is pretrained on 300 million human images at 1024 x 1024 resolution, enabling it to generalize well to in - the - wild conditions when fine - tuned for human - centric vision tasks.
The Sapiens - 0.3B model natively supports 1K high - resolution inference and shows excellent generalization even with scarce or synthetic labeled data.

📚 Documentation

Model Details

Developed by: Meta
Model type: Vision Transformer
License: Creative Commons Attribution - NonCommercial 4.0
Task: depth
Format: bfloat16
File: sapiens_0.3b_render_people_epoch_100_bfloat16.pt2

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution. The pretrained models, when finetuned for human - centric vision tasks, generalize to in - the - wild conditions. Sapiens - 0.3B natively support 1K high - resolution inference. The resulting models exhibit remarkable generalization to in - the - wild data, even when labeled data is scarce or entirely synthetic.

Model Card

Property	Details
Image Size	1024 x 768 (H x W)
Num Parameters	0.336 B
FLOPs	1.242 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1024
Num Layers	24
Num Heads	16
Feedforward Channels	4096

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: [https://huggingface.co/spaces/facebook/sapiens - depth](https://huggingface.co/spaces/facebook/sapiens - depth)
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens
Additional Results: https://rawalkhirodkar.github.io/sapiens
HuggingFace Collection: [https://huggingface.co/collections/facebook/sapiens - 66d22047daa6402d565cb2fc](https://huggingface.co/collections/facebook/sapiens - 66d22047daa6402d565cb2fc)

💻 Usage Examples

Basic Usage

The Depth 0.3B model can be used to estimate relative depth on human images. The specific code implementation may depend on the actual application scenario and the framework used.

📄 License

This model is released under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご