Authors: Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu NeurIPS, 2025
TrackingWorld is a novel approach for dense, world-centric 3D tracking from monocular videos. Our method estimates accurate camera poses and disentangles 3D trajectories of both static and dynamic components — not limited to a single foreground object. It supports dense tracking of nearly all pixels, enabling robust 3D scene understanding from monocular inputs.
TrackingWorld relies on several visual foundation model repositories included as submodules for comprehensive preprocessing.
Use the --recursive flag to clone the main repository and all necessary submodules:
git clone --recursive https://github.com/IGL-HKUST/TrackingWorld.git
cd TrackingWorldAn installation script is provided and tested with CUDA Toolkit 12.1 and Python 3.10.
conda create -n trackingworld python=3.10
conda activate trackingworld
bash scripts/install.shDownload the necessary model weights for the visual foundation models used in the pipeline:
bash scripts/download.shOur initial preprocessing involves using GPT via the OpenAI API (minimal credit usage expected). Please set your API key as an environment variable in a .env file:
echo "OPENAI_API_KEY=sk-your_api_key_here" > .envFind your API key here.
We've included the dog sequence from the DAVIS dataset as a demonstration. You can run the entire processing pipeline using the following convenience script:
bash scripts/demo.shThe demo generates a comprehensive set of intermediate and final results within the data/demo_data/ directory. The files showcase the progression from foundational model outputs to the final 4D representation. You can also download a preprocessed version of the results here.
data/demo_data/
└── dog/ # 🐾 Demo Sequence Name (e.g., DAVIS 'dog')
├── color/ # Original RGB Images
│ └── 00000.jpg, ... # Sequential RGB frames
│
├── deva/ # DEVA Model Outputs (Video Segmentation)
│ └── pred.json, Annotations/, ...
│
├── ram/ # RAM Model Outputs (Image Tagging)
│ └── tags.json # Contains RAM tags, GPT filtering results, and detected classes
│
├── unidepth/ # Depth Estimation Results
│ ├── depth.npy # Raw depth maps
│ └── intrinsics.npy # Camera intrinsic parameters
│
├── gsm2/ # GSM2 Model Outputs (Instance/Semantic Segmentation)
│ └── mask/, vis/, ...
│
├── densetrack3d_efep/ # DenseTrack3D / CoTracker Outputs
│ └── results.npz # Dense tracklet data
│
└── uni4d/ # Final Uni4D Reconstruction Outputs
└── experiment_name/ # Experiment Name (e.g., base_delta_ds2)
├── fused_track_4d_full.npz # 🔑 Fused 4D Representation (Main Output)
└── training_info.log # Training metadata
To visualize the dense 4D trajectories and the reconstructed scene, run the provided visualization script, pointing it to the main output file:
python visualizer/vis_trackingworld.py --filepath data/demo_data/dog/uni4d/base_delta_ds2/fused_track_4d_full.npzThis visualization helps interpret the world-centric motion and disentangled trajectories generated by TrackingWorld.
We plan to release more features and data soon.
- Release demo code
- Provide evaluation benchmark and metrics
If you find TrackingWorld useful for your research or applications, please consider citing our paper:
@inproceedings{
lu2025trackingworld,
title={TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels},
author={Jiahao Lu and Weitao Xiong and Jiacheng Deng and Peng Li and Tianyu Huang and Zhiyang Dou and Cheng Lin and Sai-Kit Yeung and Yuan Liu},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={[https://openreview.net/forum?id=vDV912fa3t](https://openreview.net/forum?id=vDV912fa3t)}
}Our codebase is based on Uni4D. Our preprocessing relies on DELTA, CotrackerV3, Unidepth, Tracking-Anything-with-DEVA, Grounded-Sam-2, and Recognize-Anything. We thank the authors for their excellent work!