Step-distillation for diffusion: making models run fast on a laptop
But can it run faster?
A teaser on the upcoming full tutorial on step-distillation
Step-distillation for diffusion models in simple terms
Some call it a paradox, Consistency Models was a bit of a misnomer, step-distillation is also not exactly a great name…
From the initial paper on Progressive Distillation to more modern techniques such as Distribution Matching Distillation and its second version, step distillation is sometimes mistakenly seen as wrapped in mystery.
If your base diffusion models requires 150 steps to give you good inputs, how can that possibly be reduced to 4 without a quality loss. And why 4 and not 2?
To understand what step-distillation does it’s important to consider 3 key aspects of diffusion:
The base model is trained with an all-to-all independent coupling of the data distribution and the seed Gaussian, during step-distillation we actually train a few-step generator by leveraging the ODE coupling, as in Reflow. If you think the coupling does not really matter, then look at Immiscible Diffusion;
Training a denoiser with an L2 loss is a crude way of obtaining a generator. Which is why so many denoising steps are needed by default and step-distillation methods such as DMD, DMD 2 or Moment Matching introduce an additional much more informative loss which can be interpreted as a proxy to an adversarial low-quality generator;
To fully understand step-distillation, it is key to look at the algorithms as much as the math that inspired them. This is how we will interpret Consistency Models and Multistep Consistency Models.
More details on this soon.