Pi3 (π³) Open-source Visual Geometry Learning Model - Revolutionizing Visual Geometry Reconstruction Methods with Scalable Applications

Pi3

Developed by yyfz233

π³ is a scalable permutation-equivariant visual geometric learning model that revolutionizes visual geometric reconstruction methods.

3D Vision

PyTorch

#Permutation-equivariant geometric learning #Reference-frame-free reconstruction #Multi-view 3D reconstruction

Downloads 229

Release Time : 7/14/2025

Model Overview

By eliminating the need for a fixed reference view and adopting a fully permutation-equivariant architecture, π³ can directly predict affine-invariant camera poses and scale-invariant local point maps from an unordered set of images. It is robust to the input order and highly scalable.

Model Features

Permutation equivariance

Adopting a fully permutation-equivariant architecture, it is robust to the input order and does not require a fixed reference view.

Highly scalable

The model is simply designed and unbiased, capable of handling large-scale unordered image sets.

Affine invariance

It can directly predict affine-invariant camera poses and scale-invariant local point maps.

Model Capabilities

Camera pose estimation

Monocular depth estimation

Video depth estimation

Dense point cloud estimation

Use Cases

3D reconstruction

Reconstruct 3D scenes from videos

Use video frames as input to reconstruct 3D point cloud scenes.

Achieve state-of-the-art reconstruction performance

Reconstruct from unordered image sets

Reconstruct 3D scenes from an unordered set of images without a fixed reference view.

Robust to the input order

🚀 🌌 $\pi^3$: Scalable Permutation-Equivariant Visual Geometry Learning

$\pi^3$ is a novel feed - forward neural network that revolutionizes visual geometry reconstruction by eliminating the need for a fixed reference view, achieving robust and state - of - the - art performance.

$\pi^3$ reconstructs visual geometry without a fixed reference view, achieving robust, state - of - the - art performance.

✨ Features

We introduce $\pi^3$ (Pi - Cubed), a novel feed - forward neural network that revolutionizes visual geometry reconstruction by eliminating the need for a fixed reference view. Traditional methods, which rely on a designated reference frame, are often prone to instability and failure if the reference is suboptimal.

In contrast, $\pi^3$ employs a fully permutation - equivariant architecture. This allows it to directly predict affine - invariant camera poses and scale - invariant local point maps from an unordered set of images, breaking free from the constraints of a reference frame. This design makes our model inherently robust to input ordering and highly scalable.

A key emergent property of our simple, bias - free design is the learning of a dense and structured latent representation of the camera pose manifold. Without complex priors or training schemes, $\pi^3$ achieves state - of - the - art performance 🏆 on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map estimation.

🚀 Quick Start

1. Clone & Install Dependencies

First, clone the repository and install the required packages.

git clone https://github.com/yyfz/Pi3.git
cd Pi3
pip install -r requirements.txt

2. Run Inference from Command Line

Try our example inference script. You can run it on a directory of images or a video file.

If the automatic download from Hugging Face is slow, you can download the model checkpoint manually from here and specify its local path using the --ckpt argument.

# Run with default example video
python example.py

# Run on your own data (image folder or .mp4 file)
python example.py --data_path <path/to/your/images_dir_or_video.mp4>

Optional Arguments:

--data_path: Path to the input image directory or a video file. (Default: examples/skating.mp4)
--save_path: Path to save the output .ply point cloud. (Default: examples/result.ply)
--interval: Frame sampling interval. (Default: 1 for images, 10 for video)
--ckpt: Path to a custom model checkpoint file.
--device: Device to run inference on. (Default: cuda)

3. Run with Gradio Demo

You can also launch a local Gradio demo for an interactive experience.

# Install demo - specific requirements
pip install -r requirements_demo.txt

# Launch the demo
python demo_gradio.py

📚 Documentation

Model Input & Output

The model takes a tensor of images and outputs a dictionary containing the reconstructed geometry.

Input: A torch.Tensor of shape $B \times N \times 3 \times H \times W$ with pixel values in the range [0, 1].
Output: A dict with the following keys:
- points: Global point cloud unprojected by local points and camera_poses (torch.Tensor, $B \times N \times H \times W \times 3$).
- local_points: Per - view local point maps (torch.Tensor, $B \times N \times H \times W \times 3$).
- conf: Confidence scores for local points (values in [0, 1], higher is better) (torch.Tensor, $B \times N \times H \times W \times 1$).
- camera_poses: Camera - to - world transformation matrices (4x4 in OpenCV format) (torch.Tensor, $B \times N \times 4 \times 4$).

💻 Usage Examples

Basic Usage

import torch
from pi3.models.pi3 import Pi3
from pi3.utils.basic import load_images_as_tensor # Assuming you have a helper function

# --- Setup ---
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Pi3.from_pretrained("yyfz233/Pi3").to(device).eval()
# or download checkpoints from `https://huggingface.co/yyfz233/Pi3/resolve/main/model.safetensors`

# --- Load Data ---
# Load a sequence of N images into a tensor
# imgs shape: (N, 3, H, W).
# imgs value: [0, 1]
imgs = load_images_as_tensor('examples/skating.mp4', interval=10).to(device)

# --- Inference ---
print("Running model inference...")
# Use mixed precision for better performance on compatible GPUs
dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

with torch.no_grad():
    with torch.amp.autocast('cuda', dtype=dtype):
        # Add a batch dimension -> (1, N, 3, H, W)
        results = model(imgs[None])

print("Reconstruction complete!")
# Access outputs: results['points'], results['camera_poses'] and results['local_points'].

🙏 Acknowledgements

Our work builds upon several fantastic open - source projects. We'd like to express our gratitude to the authors of:

📜 Citation

If you find our work useful, please consider citing:

@misc{wang2025pi3,
      title={$\pi^3$: Scalable Permutation-Equivariant Visual Geometry Learning}, 
      author={Yifan Wang and Jianjun Zhou and Haoyi Zhu and Wenzheng Chang and Yang Zhou and Zizun Li and Junyi Chen and Jiangmiao Pang and Chunhua Shen and Tong He},
      year={2025},
      eprint={2507.13347},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13347}, 
}

📄 License

For academic use, this project is licensed under the 2 - clause BSD License. See the LICENSE file for details. For commercial use, please contact the authors.