Sapiens Open-Source Vision Model - Pretrained on 300 million images, supports high-resolution inference and scene generalization

Sapiens Pretrain 2b Bfloat16

Developed by facebook

Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024x1024 resolution human images, supporting high-resolution inference and real-world scenario generalization.

Image Classification English#High-resolution portrait feature extraction #2.1 billion parameter Vision Transformer #1024x1024 image processing

Downloads 20

Release Time : 9/10/2024

Model Overview

Sapiens-2B is a pre-trained model based on the Vision Transformer architecture, specifically designed for human-centric vision tasks. It demonstrates exceptional generalization capabilities on real data even with scarce annotations or fully synthetic conditions.

Model Features

High-resolution support

Natively supports 1024x1024 high-resolution image processing, ideal for handling high-quality visual data.

Large-scale pre-training

Pre-trained on 300 million human images, featuring powerful feature extraction capabilities.

Real-world generalization

Demonstrates exceptional generalization on real data even with scarce annotations or fully synthetic conditions.

Efficient computation

Utilizes bfloat16 format to balance computational efficiency and model accuracy.

Model Capabilities

High-resolution image processing

Human image feature extraction

Vision task fine-tuning

Real-world scenario generalization

Use Cases

Computer vision

Human pose estimation

Utilizes pre-trained features for human pose recognition and analysis.

Face recognition

High-resolution image-based facial feature extraction and recognition.

Augmented reality

Virtual avatar generation

Used to generate realistic virtual human avatars.

🚀 Pretrain-Sapiens-2B-Bfloat16

Sapiens is a family of vision transformers pretrained on 300 million human images, offering high - resolution inference and excellent generalization.

🚀 Quick Start

This README provides detailed information about the Pretrain - Sapiens - 2B - Bfloat16 model, including its features, usage, and related resources.

✨ Features

Sapiens is a family of vision transformers pretrained on 300 million human images at 1024 x 1024 image resolution.
The pretrained models generalize well to in - the - wild conditions when finetuned for human - centric vision tasks.
Sapiens - 2B natively supports 1K high - resolution inference and shows remarkable generalization to in - the - wild data, even with scarce or synthetic labeled data.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Developed by: Meta
Model type: Vision Transformer
License: Creative Commons Attribution - NonCommercial 4.0
Task: pretrain
Format: bfloat16
File: sapiens_2b_epoch_660_bfloat16.pt2

Model Card

Property	Details
Image Size	1024 x 1024
Num Parameters	2.163 B
FLOPs	8.709 TFLOPs
Patch Size	16 x 16
Embedding Dimensions	1920
Num Layers	48
Num Heads	32
Feedforward Channels	7680

More Resources

Repository: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
Demo: Sapiens Gradio Spaces
Project Page: https://about.meta.com/realitylabs/codecavatars/sapiens
Additional Results: https://rawalkhirodkar.github.io/sapiens
HuggingFace Collection: https://huggingface.co/collections/facebook/sapiens-66d22047daa6402d565cb2fc

Uses

The pretrained 2B model can be used for feature extraction, fine - tuning, or as a starting point for training new models.

🔧 Technical Details

No technical implementation details are provided in the original document, so this section is skipped.

📄 License

The model is licensed under the Creative Commons Attribution - NonCommercial 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご