{1559} revision 0 modified: 12-24-2021 05:50 gmt |

Some investigations into denoising models & their intellectual lineage: Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015 -
*Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli* - Starting derivation of using diffusion models for training.
- Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
- Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, $p(x_{t-1}|x_t) \propto N(0, I)$
- The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)
Generative Modeling by Estimating Gradients of the Data Distribution July 2019 -
*Yang Song, Stefano Ermon*
Denoising Diffusion Probabilistic Models June 2020 -
*Jonathan Ho, Ajay Jain, Pieter Abbeel* - A diffusion model that can output 'realistic' images (low FID / low log-likelihood )
Improved Denoising Diffusion Probabilistic Models Feb 2021 -
*Alex Nichol, Prafulla Dhariwal* - This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
- The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
- That is, the neural network model attempts, given $x_t$ to estimate the noise which corrupted it, which then can be used to produce $x_{t-1}$
- Simpicity. Satisfying.
- The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
- I
*think*that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters. - There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
- Taking these further, you can estimate an image at
*any point*t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss. - Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, $\beta$ ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a $~\beta_t$ , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
- TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
- Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
- Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.
Diffusion Models Beat GANs on Image Synthesis May 2021 -
*Prafulla Dhariwal, Alex Nichol*
In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of -
*Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma*
which is an improvement to (e.g. add selt-attention layers) Conditional Image Generation with PixelCNN Decoders -
*Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu*
Most recently, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models -
*Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen*
Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image! |