Open-source Sapiens-pose-0.3b-torchscript Model - Precise Pose Estimation, Supporting 308 Key Point Detection

Sapiens Pose 0.3b Torchscript

Developed by facebook

Sapiens is a vision Transformer model pre-trained on 300 million high-resolution human images, specifically designed for pose estimation tasks, supporting 308 keypoint detection.

Pose Estimation English#High-resolution pose estimation #Full-body keypoint detection #300 million image pre-training

Downloads 55

Release Time : 9/13/2024

Model Overview

This model is used for full-body keypoint (body + face + hands + feet) estimation in single images, performing excellently at 1024x768 resolution.

Model Features

High-resolution support

Natively supports 1024x768 high-resolution input, suitable for fine-grained pose analysis

Multi-part keypoint detection

Simultaneously detects 308 keypoints across body, face, hands, and feet

Strong generalization capability

Pre-trained on 300 million images, demonstrating excellent performance in real-world scenarios

Efficient inference

1.242 trillion FLOPs computational load, balancing accuracy and efficiency

Model Capabilities

Full-body pose estimation

Multi-part keypoint detection

High-resolution image processing

Use Cases

Motion analysis

Sports pose analysis

Used for athlete motion capture and posture correction

Can accurately identify 308 keypoints

Human-computer interaction

Gesture recognition

Recognizes complex hand movements

Includes hand keypoint detection

🚀 Pose-Sapiens-0.3B-Torchscript

Sapiens is a vision transformer family pretrained on 300 million high - resolution human images. It generalizes well for human - centric vision tasks, even in challenging real - world scenarios.

🚀 Quick Start

The Pose - Sapiens - 0.3B - Torchscript model offers high - performance keypoint detection capabilities. It's based on a vision transformer architecture pretrained on a large - scale human image dataset.

✨ Features

The Sapiens family is pretrained on 300 million human images at 1024 x 1024 resolution, enabling excellent generalization for human - centric vision tasks.
Sapiens - 0.3B natively supports 1K high - resolution inference, and shows remarkable performance even with scarce or synthetic labeled data.
The Pose 0.3B model can estimate 308 keypoints (body + face + hands + feet) on a single image.

📚 Documentation

Model Details

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution. The pretrained models, when finetuned for human - centric vision tasks, generalize to in - the - wild conditions. Sapiens - 0.3B natively support 1K high - resolution inference. The resulting models exhibit remarkable generalization to in - the - wild data, even when labeled data is scarce or entirely synthetic.

Property	Details
Developed by	Meta
Model Type	Vision Transformer
License	Creative Commons Attribution - NonCommercial 4.0
Task	pose
Format	torchscript
File	sapiens_0.3b_goliath_best_goliath_AP_573_torchscript.pt2

Model Card

Property	Details
Image Size	1024 x 768 (H x W)
Num Parameters	0.336 B
FLOPs	1.242 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1024
Num Layers	24
Num Heads	16
Feedforward Channels	4096

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: https://huggingface.co/spaces/facebook/sapiens-pose
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens
Additional Results: https://rawalkhirodkar.github.io/sapiens
HuggingFace Collection: https://huggingface.co/collections/facebook/sapiens-66d22047daa6402d565cb2fc

📄 License

This model is licensed under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご