Open-source Sapiens-depth-0.6b-torchscript Model - Efficiently Process Human-centered Visual Tasks

Sapiens Depth 0.6b Torchscript

Developed by facebook

Sapiens is a vision transformer series model pre-trained on 300 million 1024 x 1024 resolution human images, focusing on human-centric vision tasks.

3D Vision English#High-resolution depth estimation #Human-centric vision #300 million image pre-training

Downloads 34

Release Time : 9/9/2024

Model Overview

This model is used to estimate the relative depth of human images, supports high-resolution inference, and demonstrates exceptional generalization capabilities on real data.

Model Features

High-resolution support

Natively supports 1K high-resolution inference, suitable for high-quality image processing.

Exceptional generalization capability

Demonstrates outstanding generalization on real data even with scarce labeled data or fully synthetic scenarios.

Large-scale pre-training

Pre-trained on 300 million 1024 x 1024 resolution human images, featuring powerful feature extraction capabilities.

Model Capabilities

Human image depth estimation

High-resolution image processing

Use Cases

Computer vision

Human depth estimation

Used to estimate the relative depth of human images, applicable in virtual reality, augmented reality, and other scenarios.

Demonstrates exceptional generalization capabilities on real data.

🚀 Depth-Sapiens-0.6B-Torchscript

Depth-Sapiens-0.6B-Torchscript is a vision transformer model for depth estimation, pretrained on a large - scale human image dataset, which can generalize well to real - world scenarios.

🚀 Quick Start

This model is designed for depth estimation tasks on human images. It has been pretrained on 300 million human images at a resolution of 1024 x 1024, and can be fine - tuned for human - centric vision tasks.

✨ Features

Sapiens is a family of vision transformers pretrained on a large number of high - resolution human images.
Sapiens - 0.6B natively supports 1K high - resolution inference and shows excellent generalization ability, even with scarce or synthetic labeled data.

📚 Documentation

Model Details

Developed by: Meta
Model type: Vision Transformer
License: Creative Commons Attribution - NonCommercial 4.0
Task: depth
Format: torchscript
File: sapiens_0.6b_render_people_epoch_70_torchscript.pt2

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution. The pretrained models, when finetuned for human - centric vision tasks, generalize to in - the - wild conditions. Sapiens - 0.6B natively support 1K high - resolution inference. The resulting models exhibit remarkable generalization to in - the - wild data, even when labeled data is scarce or entirely synthetic.

Model Card

Property	Details
Image Size	1024 x 768 (H x W)
Num Parameters	0.664 B
FLOPs	2.583 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1280
Num Layers	32
Num Heads	16
Feedforward Channels	5120

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: https://huggingface.co/spaces/facebook/sapiens-depth
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens/
Additional Results: https://rawalkhirodkar.github.io/sapiens/
HuggingFace Collection: https://huggingface.co/collections/facebook/sapiens-66d22047daa6402d565cb2fc

💻 Usage Examples

Basic Usage

The Depth 0.6B model can be used to estimate relative depth on human images.

📄 License

This model is released under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご