In teacache_forward():
Here is the accumulated metric in code:
self.accumulated_rel_l1_distance_even += rescale_func(((modulated_inp-self.previous_e0_even).abs().mean() / self.previous_e0_even.abs().mean()).cpu().item())
The modulated input is defined here:
modulated_inp = e0 if self.use_ref_steps else e
Where e and e0 are defined here:
e = self.time_embedding( sinusoidal_embedding_1d(self.freq_dim, t).float()) e0 = self.time_projection(e).unflatten(1, (6, self.dim))
So the accumulated metric is only a function of the time embedding. Why is the noisy input not being incorporated here, like specified in the paper? The practical effect of this is no matter what the prompt is the pattern of the accumulated metric is the same.