Skip to content

Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".

Notifications You must be signed in to change notification settings

SCAILab-USTC/STSA

Repository files navigation

STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".

YouTube framework

News: πŸŽ‰ This paper was selected as ICME 2025 Oral!

Todo:

  • inference code
  • paper & supplementary material
  • youtube demo
  • training code
  • fine-tuning code

Demo:

Multilingual Generation

chinese.mp4
korean.mp4
japanese.mp4
spanish.mp4

Long Video Generation Compared with SOTA Methods

We compare our method with DiffTalk(CVPR23'), DINet(AAAI23'), IP-LAP(CVPR23'), MuseTalk(Arxiv2024), PC-AVS(CVPR21'), TalkLip(CVPR23'), Wav2Lip(MM'20)

Ours.mp4
DiffTalk.mp4
DINet.mp4
IP-LAP.mp4
MuseTalk.mp4
PC-AVS.mp4
TalkLIp.mp4
Wav2Lip.mp4

Inference:

Requirements

  • Python 3.8.7
  • torch 1.12.1
  • torchvision 0.13.1
  • librosa 0.9.2
  • ffmpeg

Prepare Environment

First create conda environment:

conda create -n stsa python=3.8
conda activate stsa

Pytorch 1.12.1 is used, other requirements are listed in "requirements.txt". Please run:

pip install -r requirements.txt

Quick Start

Download the pretrained weights, and put the weights under ./checkpoints After this, run the following command:

python inference.py --video_path "demo_templates/video/speakerine.mp4" --audio_path "demo_templates/audio/education.wav"

You can specify the --video_path and --audio_path option to inference other videos.

Training:

Dataset Pre-process

  1. Download LRS2 dataset, and move the LRS2/mvlrs_v1/main/ folder into ./processed_lrs2 folder.

  2. Extract audio from LRS2 videos by running:

python preprocess/preprocess_audio.py --data_root ./processed_lrs2/main/ --out_root ./processed_lrs2/lrs2_audio
  1. Extract Wav2Vec 2.0 feature by running:
python preprocess/extract_wav2vec_feature.py
  1. Extract face, sketch, landmarks by running:
python preprocess/preprocess_face.py
  1. Convert sketch into heatmap by running:
python preprocess/preprocess_heatmap.py

After precessing, the processed_lrs2 folder structure is following:

./processed_lrs2/
β”œβ”€β”€ lrs2_audio/
β”œβ”€β”€ lrs2_face/
β”œβ”€β”€ lrs2_heat_img_lower/
β”œβ”€β”€ lrs2_heat_img_upper/
β”œβ”€β”€ lrs2_heat_img_whole/
β”œβ”€β”€ lrs2_heatmap_lower/
β”œβ”€β”€ lrs2_heatmap_upper/
β”œβ”€β”€ lrs2_heatmap_whole/
β”œβ”€β”€ lrs2_landmarks/
β”œβ”€β”€ lrs2_sketch_lower/
β”œβ”€β”€ lrs2_sketch_upper/
β”œβ”€β”€ lrs2_sketch_whole/
└── main/

Stage1: Heatmap Predictor Training

Run the following command, and adjust the lr (56 line) to 1e-5 at 75k step, to 1e-6 at 130k step.

python train_stage1.py

Stage2: Face Synthesizer Training

First downlowad the syncnet pretrained weight from here, and put it under ./checkpoints/syncnet/ Then run the following command:

python train_stage2.py

Stage3: End-to-end Fine-tuning

In train_stage3.py, replace the finetune_path (60 line) and finetune_path_disc (61 line) to the face synthesizer and discriminator weight path which you've trained in stage2; Replace the heatmap_finetune_path (62 line) to the heatmap predictor weight path which you've trained in stage1. The run the following command:

python train_stage3.py

Acknowledge:

We thank IP-LAP, Wav2Lip, DINet, LAB and DIM for making their open-source resources available, which supported the development of this work.

Citation:

If you find this project useful, welcome to cite us!

@article{ding2025stsa,
  title={STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing},
  author={Ding, Zijun and Xiong, Mingdie and Zhu, Congcong and Chen, Jingrun},
  journal={arXiv preprint arXiv:2503.23039},
  year={2025}
}

About

Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages