Some investigations into denoising models & their intellectual lineage:
Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015
 Jascha SohlDickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
 Starting derivation of using diffusion models for training.
 Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian perpixel, and train an inversediffusion model to remove the noise at each step. Then start with Gaussian noise and reversediffuse an image.
 Diffusion can take 100s  1000s of steps; steps are made small to preserve the assumption that the conditional probability, $p(x_{t1}x_t) \propto N(0, I)$
 The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)
Generative Modeling by Estimating Gradients of the Data Distribution July 2019
Denoising Diffusion Probabilistic Models June 2020
 Jonathan Ho, Ajay Jain, Pieter Abbeel
 A diffusion model that can output 'realistic' images (low FID / low loglikelihood )
Improved Denoising Diffusion Probabilistic Models Feb 2021
 Alex Nichol, Prafulla Dhariwal
 This is directly based on Ho 2020 and SholDickstein 2015, but with tweaks
 The objective is no longer the loglikelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
 That is, the neural network model attempts, given $x_t$ to estimate the noise which corrupted it, which then can be used to produce $x_{t1}$
 The also include a reweighted version of the loglikelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
 I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
 There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled  at each noising step, you need to scale the mean down to prevent Brownian / random walk.
 Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but reweighted/scheduled) t and the LL loss + simple loss.
 Ho 2020 above treats the variance of the noising Gaussian as fixed  that is, $\beta$ ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a $~\beta_t$ , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inversediffusion timestep.
 TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
 Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
 Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster  rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.
Diffusion Models Beat GANs on Image Synthesis May 2021
 Prafulla Dhariwal, Alex Nichol
In all of above, it seems that the inversediffusion function approximator is a minor player in the paper  but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a Unet convolutional network (e.g. start with few channels, downsample and double the channels until there are 128256 channels, then upsample x2 and half the channels) including multiheaded attention. Ho 2020 used singleheaded attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++
PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017
 Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma
which is an improvement to (e.g. add seltattention layers)
Conditional Image Generation with PixelCNN Decoders
 Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu
Most recently,
GLIDE: Towards Photorealistic Image Generation and Editing with TextGuided Diffusion Models
 Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen
Added textconditional generation + many more parameters + much more compute to yield very impressive image results + inpainting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model  you can condition on other parts of the image!
