Yifan Wang1*
Jianjun Zhou123*
Haoyi Zhu1
Wenzheng Chang1
Yang Zhou1
Zizun Li1
Junyi Chen1
Jiangmiao Pang1
Chunhua Shen2
Tong He13†
1Shanghai AI Lab 2ZJU 3SII
* Equal Contribution † Corresponding Author
π³ reconstructs visual geometry without a fixed reference view, achieving robust, state-of-the-art performance.
- [December 28, 2025] 🚀 Pi3X Released! We have upgraded the model to Pi3X. This improved version eliminates grid artifacts (smoother point clouds), supports conditional injection (camera pose, intrinsics, depth), and enables approximate metric scale reconstruction.
- [September 3, 2025] ⭐️ Training code is updated! See
trainingbranch for details. - [July 29, 2025] 📈 Evaluation code is released! See
evaluationbranch for details. - [July 16, 2025] 🚀 Hugging Face Demo and inference code are released!
We introduce
In contrast,
Building upon the original framework, we present Pi3X, an enhanced version focused on flexibility and reconstruction quality:
- Smoother Reconstruction: We replaced the original output head with a Convolutional Head, significantly reducing grid-like artifacts and producing much smoother point clouds.
- Flexible Conditioning: Pi3X supports the optional injection of camera poses, intrinsics, and depth. This allows for more controlled reconstruction when partial priors are available.
- Metric Scale: The model now supports metric scale reconstruction (approximate), moving beyond purely scale-invariant predictions.
Overall, Pi3X offers slightly better reconstruction quality than the original
A key emergent property of our simple, bias-free design is the learning of a dense and structured latent representation of the camera pose manifold. Without complex priors or training schemes,
First, clone the repository and install the required packages.
git clone https://github.com/yyfz/Pi3.git
cd Pi3
pip install -r requirements.txtTry our example inference script. You can run it on a directory of images or a video file.
If the automatic download from Hugging Face is slow, you can download the model checkpoint manually from here and specify its local path using the --ckpt argument.
# Run with the default example video
# python example.py # Inference with Pi3 (Original)
python example_mm.py # [New] Inference with Pi3X (Recommended)
# Run on your own data (image folder or .mp4 file)
# python example.py --data_path <path/to/data> # Pi3
python example_mm.py --data_path <path/to/data> # Pi3XTo utilize additional input modalities (e.g., camera poses, intrinsics, or depth), please refer to example_mm.py for specific data formatting details.
Below is an example comparing reconstruction with and without condition injection. You can compare the resulting point clouds to observe the improvements brought by multimodal inputs.
# 1. Inference WITH conditioning (poses, intrinsics, etc.)
python example_mm.py --data_path examples/room/rgb --conditions_path examples/room/condition.npz --save_path examples/room_with_conditions.ply
# 2. Inference WITHOUT conditioning (image only)
python example_mm.py --data_path examples/room/rgb --save_path examples/room_no_conditions.plyOptional Arguments:
--data_path: Path to the input image directory or a video file. (Default:examples/skating.mp4)--save_path: Path to save the output.plypoint cloud. (Default:examples/result.ply)--interval: Frame sampling interval. (Default:1for images,10for video)--ckpt: Path to a custom model checkpoint file.--device: Device to run inference on. (Default:cuda)
You can also launch a local Gradio demo for an interactive experience.
# Install demo-specific requirements
pip install -r requirements_demo.txt
# Launch the demo
python demo_gradio.pyThe model takes a tensor of images and outputs a dictionary containing the reconstructed geometry.
-
Input: A
torch.Tensorof shape$B \times N \times 3 \times H \times W$ with pixel values in the range[0, 1]. -
Output: A
dictwith the following keys:-
points: Global point cloud unprojected bylocal pointsandcamerae_poses(torch.Tensor,$B \times N \times H \times W \times 3$ ). -
local_points: Per-view local point maps (torch.Tensor,$B \times N \times H \times W \times 3$ ). -
conf: Confidence scores for local points (values in[0, 1]aftertorch.sigmoid(), higher is better) (torch.Tensor,$B \times N \times H \times W \times 1$ ). -
camera_poses: Camera-to-world transformation matrices (4x4in OpenCV format) (torch.Tensor,$B \times N \times 4 \times 4$ ).
-
Here is a minimal example of how to run the model on a batch of images.
import torch
# from pi3.models.pi3 import Pi3 # old version
from pi3.models.pi3x import Pi3X # new version (Recommended)
from pi3.utils.basic import load_images_as_tensor # Assuming you have a helper function
# --- Setup ---
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# model = Pi3.from_pretrained("yyfz233/Pi3").to(device).eval()
model = Pi3X.from_pretrained("yyfz233/Pi3X").to(device).eval()
# or download checkpoints from `https://huggingface.co/yyfz233/Pi3/resolve/main/model.safetensors`
# --- Load Data ---
# Load a sequence of N images into a tensor
# imgs shape: (N, 3, H, W).
# imgs value: [0, 1]
imgs = load_images_as_tensor('path/to/your/data', interval=10).to(device)
# --- Inference ---
print("Running model inference...")
# Use mixed precision for better performance on compatible GPUs
dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
with torch.no_grad():
with torch.amp.autocast('cuda', dtype=dtype):
# Add a batch dimension -> (1, N, 3, H, W)
results = model(imgs[None])
print("Reconstruction complete!")
# Access outputs: results['points'], results['camera_poses'] and results['local_points'].Our work builds upon several fantastic open-source projects. We'd like to express our gratitude to the authors of:
If you find our work useful, please consider citing:
@article{wang2025pi,
title={$$\backslash$pi\^{} 3$: Permutation-Equivariant Visual Geometry Learning},
author={Wang, Yifan and Zhou, Jianjun and Zhu, Haoyi and Chang, Wenzheng and Zhou, Yang and Li, Zizun and Chen, Junyi and Pang, Jiangmiao and Shen, Chunhua and He, Tong},
journal={arXiv preprint arXiv:2507.13347},
year={2025}
}For academic use, this project is licensed under the 2-clause BSD License. See the LICENSE file for details. For commercial use, please contact the authors.