Skip to content

[wan2.2] follow-up #12024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

[wan2.2] follow-up #12024

wants to merge 7 commits into from

Conversation

yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Jul 30, 2025

to use only high-noise stage/transformer, set boundary_ratio to be 0
to use only low-noise stage/transformer_2, set boundary_ratio to be 1

Two-Stage Denoising Loop

boundary_ratio = 0.9 (90%)

Timestep:     1000 ─────── 900 ──────────────────────────────► 0
Noise Level:  High ──────────────────────────────────────────► Low
                              │
                  boundary_timestep (0.9 * 1000 = 900)
                              │
              ┌───────────────┼─────────────────────────────────┐
              │               │                                 │
         HIGH NOISE STAGE  BOUNDARY              LOW NOISE STAGE
        (t >= 900)            │                   (t < 900)
              │               │                                 │
        Uses: transformer     │               Uses: transformer_2
        Scale: guidance_scale │               Scale: guidance_scale_2
              │               │                                 │
              └───────────────┼─────────────────────────────────┘
                              │

boundary_ratio = 1.0 (100% - Single Stage)

Timestep:     1000 ────────────────────────────────────────────► 0
Noise Level:  High ────────────────────────────────────────────► Low

boundary_timestep (1.0 * 1000 = 1000)

│ ┌─────────────────────────────────────────────────┐
│ │ │
│ │ LOW NOISE STAGE (ONLY) │
│ │ │
│ │ Uses: transformer_2 (only) │
│ │ Scale: guidance_scale_2 (only) │
│ │ │
│ └─────────────────────────────────────────────────┘

Stage Breakdown

boundary_ratio = 0.9

StageTimestep RangeModel UsedGuidance ScaleDuration
High Noiset >= 900transformerguidance_scale~10% of steps
Low Noiset < 900transformer_2guidance_scale_2~90% of steps

boundary_ratio = 1.0

StageTimestep RangeModel UsedGuidance ScaleDuration
High Noiset >= 1000 (never true)N/AN/A0% of steps
Low Noise (Only)t < 1000 (always true)transformer_2guidance_scale_2100% of steps
# wan 2.2 5b i2v slow test (using only low-noise stage transformer)
import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, ModularPipeline
from diffusers.utils import export_to_video, load_image
import time


model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda:1"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype, transformer=None, boundary_ratio=1.0)
pipe.to(device)


image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor14B", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    output="processed_image"
)


prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
generator = torch.Generator(device=device).manual_seed(0)

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=image.height,
    width=image.width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]


export_to_video(output, "output.mp4", fps=16)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@JoeGaffney
Copy link

Hey,

With them both being accessed in one loop would enable_model_cpu_offload still work. Like when it hits the transformer_2 it could offload transformer?

Cheers,
Joe

@okaris
Copy link
Contributor

okaris commented Jul 31, 2025

@JoeGaffney interesting question, i was trying to debug a similar issue with enable_model_cpu_offload yesterday and i think i will need to open an issue.

for wan2.2 i2v the execution path is
text_encoder->vae(encode)->transformer->transformer_2->vae(decode)

while the offload sequence is defined as

text_encoder->transformer->transformer_2->vae

for me it causes the text_encoder to stay on gpu after encoding the initial image, but also changing the sequence is problematic and currently it supports a single position. i know this deserves it's own issue, still collecting examples

coming back to your comment i think it's a valid argument that if the sequence is defined as transformer->transformer_2 and 2 never runs it might get stuck on the gpu.

@JoeGaffney
Copy link

Hey @okaris
I'm wondering if the behaviours have diverged enough between 2.1 and 2.2 that these should maybe be separate pipelines. I'm not an expert in authoring diffusers pipelines, but there seem to be significant changes in expectations and data flow. Maybe there are common components that could be lifted out to avoid duplication, while still reflecting the divergence.

Also, I think referencing something like this in the example:

 image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor14B", trust_remote_code=True)
 image = image_processor(
     image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
     output="processed_image"
)

...feels a bit fragile. It relies on remote, opaque behavior. I get that it's convenient, but it’s not ideal for production integration or debugging. It would be great to also provide a minimal example that shows how to prepare inputs manually.

Cheers,
Joe

@ukaprch
Copy link

ukaprch commented Jul 31, 2025

Some have mentioned that during their testing the 1st stage is relatively useless. Has anyone on the team actually done sufficient testing on WAN 2.2 to make a determination as to what approach works the best?

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Aug 1, 2025

for me it causes the text_encoder to stay on gpu after encoding the initial image, but also changing the sequence is problematic and currently it supports a single position. i know this deserves it's own issue, still collecting examples

so, yes text_encoder will stay on gpu after encoding the initial image, but it will get offloaded when transformer is used - so it does not matter that it stayed on GPU since the memory bottleneck is transformer (i.e. it does not increasee overall memory requirement)

vae would stay on GPU until the it's used again, including the time when transformers are loaded - but it's really small relatively so does not make that much diffeerence. we could force offload vae though if it's needed

mayankagrawal10198 referenced this pull request Aug 3, 2025
* support lightx2v lora in wan

* add docsa.

* reviewer feedback

* empty
@mayankagrawal10198
Copy link

mayankagrawal10198 commented Aug 3, 2025

@yiyixuxu How much boundary_ratio I should keep for lightx2v lora where inference step is just 4
please refer this 9a2eaed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants