Skip to content

[feat] Add option to use (Scheduled) Huber Loss in all diffusion training pipelines to improve resilience to data corruption and better image quality #7488

@kabachuha

Description

@kabachuha

Is your feature request related to a problem? Please describe.

Diffusion models are known to be vulnerable to outliers in training data. Therefore it's possible for a relatively small number of corrupting samples to "poison" the model, making it unable to produce desired output, which has been exploited by the programs such as Nightshade.

One of the reasons of this vulnerability may lie in the commonly used L2 (Mean Squared Error) loss, the fundamental part of diffusion/flow models, which is also highly sensitive to ourliers, see Anscombe's quartet for some examples.

Describe the solution you'd like.

In our new paper (also my first paper 🥳) "Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss" https://arxiv.org/abs/2403.16728 we present a novel scheme to improve the score-matching models resilience to data corruption of parts of their datasets, introducing Huber Loss -- long used in robust regression, such as when you need to restore a contour in highly noised computer vision tasks -- and Scheduled Huber Loss. Huber loss behaves exactly like L2 around zero and like L1 (Mean Absolute Error) as it tends towards infinity, making it punish the outliers less hardly than the quadratic MSE. However, a common concern is that it may hinder the models capability to learn diverse concepts and small details, that's like we introduced Scheduled Pseudo-Huber loss with the decreasing parameter, so that the loss will behave like Huber loss on early reverse-diffusion timesteps, when the image only begins to form and is most vulnerable to be lead astray, and like L2 on final timesteps, to learn fine details of the images.

Describe alternatives you've considered.

We made tests with Pseudo-Huber Loss, Scheduled Pseudo-Huber Loss and L2, and SPHL beats the rest in nearly all cases. (On the plots the Resilence is similarity to clean pictures on partially corrupted runs minus the similarity to clean pictures on clean runs, see paper for more details)

Screenshot 2024-03-27 at 11-55-48 Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss - 2403 16728 pdf

Another alternatives are data filtration, image recaptioning (what may also be vulnerable to adversarial noise) and or "diffusion purification", which would require additional resources and may be impractical in case of large models training and false negatives, which may be drastic outliers with high corrupting potential.

👀 Also we found that the Diffusers LCM training script has a wrong Pseudo-Huber Loss coefficient proportionality (and this mistake was in the original OpenAI's article about LCMs), resulting in wrong asymptotics as its parameter tends to 0 or to infinity, resulting in the most negative impact when it is timestep-scheduled. This would be nice to fix as well (maybe adding a compatibility option for previously made LCMs)

We show that our scheme works in text2speech diffusion domain as well, further supporting the claims.

Screenshot 2024-03-27 at 12-00-00 Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss - 2403 16728 pdf

Additional context.

As a side effect (which I remembered after publishing, when I was looking through the sampled pictures), Huber loss also seems to improve the "vibrancy" of pictures on clean runs, though the mechanism behind it is unknown (maybe better concept disentanglement?). I think it would be nice to include at least simply because of this effect 🙃

vanilla_vs_huber


As I literally was behind this idea and made the experiments with modified Diffusers library, I have all the code on my hands and will make a PR soon


We also extensively tried to prove a theorem, claiming that in the event of corrupting samples present in the dataset (the third moment "skewness" of the distribution is greater than zero), the usage of Scheduled Pseudo-Huber loss with timestep-decreasing parameter will result in less KL divergence between the clean data and the data distribution generated by an ideal score-matching (e.g. diffusion) model than the usage of L2, but there was a mistake in the proof and we stuck. If you'd like to take a look at our proof attempt, PM me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions