Skip to content

Tele-AI/OmniVDiff

Repository files navigation

OmniVDiff:
Omni Controllable Video Diffusion for Generation and Understandingn

Dianbing Xi1,2,*, Jiepeng Wang2,*,‡, Yuanzhi Liang2, Xi Qiu2, Yuchi Huo1, Rui Wang1,†, Chi Zhang2,†, Xuelong Li2,†

*Equal contribution.   †Corresponding author.   ‡Project leader.

1State Key Laboratory of CAD&CG, Zhejiang University
2Institute of Artificial Intelligence, China Telecom (TeleAI)

📄 Paper   ·   🌐 Project Page   ·   🤗 ModelScope

AAAI 2026

📌 Intro

OmniVDiff enables controllable video generation and understanding in a unified video diffusion framework.

📦 Environment Setup

  1. Create a conda environment named ovdiff:

    conda create -n ovdiff python=3.10.9
    conda activate ovdiff
  2. Install the required packages:

    pip install -r requirements.txt
  3. Install our modified version of the diffusers library.
    Navigate to the diffusers directory and run:

    pip install -e .

🤗 Model Zoo

OmniVDiff is available in the ModelScope Hub.

🔍 Inference

  1. Navigate to the inference directory:

    cd inference
  2. Run batch inference:

    python batch_infer.py
    # -1 no condition, 0:rgb, 1:depth, 2:canny, 3:segment
    python batch_infer.py --idx_cond_modality -1 --output_dir "./output_cond=-1"
    python batch_infer.py --idx_cond_modality 0 --output_dir "./output_cond=0"
    python batch_infer.py --idx_cond_modality 1 --output_dir "./output_cond=1"
    python batch_infer.py --idx_cond_modality 2 --output_dir "./output_cond=2"
    python batch_infer.py --idx_cond_modality 3 --output_dir "./output_cond=3"

🏋️‍♂️ Training

We provide an example configuration for training on 2 GPUs with batch_size=1.
You can modify the configuration file(.yaml) to adjust the number of GPUs to fit different hardware setups.

  1. Navigate to the finetune directory:
cd finetune
  1. Enable cached latents before training Before starting the actual training, enable the following option in train.sh to use cached latents:
-check_cache "true"

This will generate and store latent representations for faster training.

bash train.sh
  1. Disable the option and start training After the latent cache has been prepared, disable the option (set it to "false" or comment it out) and begin training:
bash train.sh

🙏 Acknowledgements

We sincerely thank the developers of the following open-source repositories, whose contributions have been invaluable to our research:

📜 Citation

If you find our work helpful in your research, please consider citing it using the BibTeX entry below.

@article{xdb2025OmniVDiff,
  author    = {Xi, Dianbing and Wang, Jiepeng and Liang, Yuanzhi and Qi, Xi and Huo, Yuchi and Wang, Rui and Zhang, Chi and Li, Xuelong},
  title     = {OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding},
  journal   = {arXiv preprint arXiv:2504.10825},
  year      = {2025},
}

@misc{xdb2025CtrlVDiff,
      title={CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion}, 
      author={Dianbing Xi and Jiepeng Wang and Yuanzhi Liang and Xi Qiu and Jialun Liu and Hao Pan and Yuchi Huo and Rui Wang and Haibin Huang and Chi Zhang and Xuelong Li},
      year={2025},
      eprint={2511.21129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21129}, 
}

About

Omni Controllable Video Diffusion

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •