Dianbing Xi1,2,*, Jiepeng Wang2,*,‡, Yuanzhi Liang2, Xi Qiu2, Yuchi Huo1, Rui Wang1,†, Chi Zhang2,†, Xuelong Li2,†
*Equal contribution. †Corresponding author. ‡Project leader.
1State Key Laboratory of CAD&CG, Zhejiang University
2Institute of Artificial Intelligence, China Telecom (TeleAI)
📄 Paper · 🌐 Project Page · 🤗 ModelScope
OmniVDiff enables controllable video generation and understanding in a unified video diffusion framework.
-
Create a conda environment named
ovdiff:conda create -n ovdiff python=3.10.9 conda activate ovdiff
-
Install the required packages:
pip install -r requirements.txt
-
Install our modified version of the
diffuserslibrary.
Navigate to thediffusersdirectory and run:pip install -e .
OmniVDiff is available in the ModelScope Hub.
-
Navigate to the
inferencedirectory:cd inference -
Run batch inference:
python batch_infer.py
# -1 no condition, 0:rgb, 1:depth, 2:canny, 3:segment python batch_infer.py --idx_cond_modality -1 --output_dir "./output_cond=-1" python batch_infer.py --idx_cond_modality 0 --output_dir "./output_cond=0" python batch_infer.py --idx_cond_modality 1 --output_dir "./output_cond=1" python batch_infer.py --idx_cond_modality 2 --output_dir "./output_cond=2" python batch_infer.py --idx_cond_modality 3 --output_dir "./output_cond=3"
We provide an example configuration for training on 2 GPUs with batch_size=1.
You can modify the configuration file(.yaml) to adjust the number of GPUs to fit different hardware setups.
- Navigate to the
finetunedirectory:
cd finetune- Enable cached latents before training
Before starting the actual training, enable the following option in
train.shto use cached latents:
-check_cache "true"This will generate and store latent representations for faster training.
bash train.sh- Disable the option and start training After the latent cache has been prepared, disable the option (set it to "false" or comment it out) and begin training:
bash train.shWe sincerely thank the developers of the following open-source repositories, whose contributions have been invaluable to our research:
If you find our work helpful in your research, please consider citing it using the BibTeX entry below.
@article{xdb2025OmniVDiff,
author = {Xi, Dianbing and Wang, Jiepeng and Liang, Yuanzhi and Qi, Xi and Huo, Yuchi and Wang, Rui and Zhang, Chi and Li, Xuelong},
title = {OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding},
journal = {arXiv preprint arXiv:2504.10825},
year = {2025},
}
@misc{xdb2025CtrlVDiff,
title={CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion},
author={Dianbing Xi and Jiepeng Wang and Yuanzhi Liang and Xi Qiu and Jialun Liu and Hao Pan and Yuchi Huo and Rui Wang and Haibin Huang and Chi Zhang and Xuelong Li},
year={2025},
eprint={2511.21129},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21129},
}