An introduction to Diffusion Probabilistic Models

Generative modelling is one of the seminal tasks for understanding the distribution of natural data. VAE, GAN and Flow family of models have dominated the field for last few years due to their practical performance. Despite commercial success, their theoretical and design shortcomings (intractable likelihood computation, restrictive architecture, unstable training dynamics etc.) have led to the developement of a new class of generative models named “Diffusion Probabilistic Models” or DPMs. Diffusion Models, first proposed by Sohl-Dickstein et al. (2015), inspire from thermodynam diffusion process and learn a noise-to-data mapping in discrete steps, very similar to Flow models. Lately, DPMs have been shown to have some intriguing connections to Score Based Models (SBMs) and Stochastic Differential Equations (SDE). These connections have further been leveraged to create their continuous-time analogous. In this article, I will describe both the general framework of DPMs, their recent advancements and explore the connections to other frameworks. For the sake of readers, I will avoid gory details, rigorous mathematical derivations and use subtle simplifications in order to maintain focus on the core idea.

In case you haven’t checked the first part of this two-part blog, please read Score Based Models (SBMs).

What exactly do we mean by “Diffusion” ?

In thermodynamics, “Diffusion” refers to the flow of particles from high-density regions towards low-density regions. In the context of statistics, meaning of Diffusion is quite similar, i.e. the process of transforming a complex distribution $p_{complex}$ on $R^{d}$ to a simple (predefined) distribution $p_{prior}$ on the same domain. Succinctly, a transformation $T : R^{d} \to R^{d}$ such that

$\begin{matrix} (1) & x_{0} \sim p_{complex} ⟹ T (x_{0}) \sim p_{prior} \end{matrix}$

where the symbol $⟹$ means “implies”. There is a formal way to come up with a specific specific ${T, p_{prior}}$ pair that satisfies Equation 1 for any distribution $p_{complex}$ . In simple terms, we can take any distribution and transform it into a known (simple) density by means of a known transformation $T$ . By “formal way”, I was referring to Markov Chain and its stationary distribution, which says that by repeated application of a transition kernel $q (x | x^{'})$ on the samples of any distribution would lead to samples from $p_{prior} (x)$ if the following holds

$p_{prior} (x) = \int q (x ∥ x^{'}) p_{prior} (x^{'}) d x^{'}$

We can related our original diffusion process in Equation 1 with a markov chain by defining $T$ to be repeated application of the transition kernel $q (x | x^{'})$ over discrete time $t$

$\begin{matrix} (2) & x_{t} \sim q (x | x^{'} = x_{t - 1}), \forall t > 0 \end{matrix}$

From the properties of stationary distribution, we have $x_{\infty} \sim p_{prior}$ . In practice, we can keep the iterations to a sufficiently large finite number $t = T$ .

So far, we confirmed that there is indeed an iterative way (refer to Equation 2) to convert the samples from a complex distributions to a known (simple) prior. Even though we talked only in terms of generic densities, there is one very attractive choice of ${q, p_{prior}}$ pair (showed in Sohl-Dickstein et al. (2015)) due to its simplicity and tractability

$\begin{matrix} (3) & q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) q (x_{T}) = p_{prior} (x_{T}) = N (x_{T}; 0, I) \end{matrix}$

For obvious reason, its known as Gaussian Diffusion. I purposefuly changed the notations of the random variables to make it more explicit. $β_{t} \in R$ is a predefined decaying schedule proposed by Sohl-Dickstein et al. (2015). A pictorial depiction of the diffusion process is shown in Figure 1.

Generative modlling by undoing the diffusion

We proved the existence of a stochastic transform $T$ that gurantees the diffusion process in Eq 1. Please realize that the diffusion process does not depend on the initial density $p_{complex}$ (as $t \to \infty$ ) and the only requirement is to be able to sample from it. This is the core idea behind Diffusion Models - we use the any data distribution (let’s say $p_{data}$ ) of our choice as the complex initial density. This leads to the forward diffusion process

$x_{0} \sim p_{data} ⟹ x_{T} = T (x_{0}) \sim N (0, I)$

This process is responsible for “destructuring” the data and turning it into an isotropic gaussian (almost structureless). Please refer to the figure below (red part) for a visual demonstration.

However, this isn’t very usefull by itself. What would be useful is doing the opposite, i.e. starting from isotropic gaussian noise and turning it into $p_{data}$ - that is generative modelling (blue part of the figure above). Since the forward process is fixed (non-parametric) and guranteed to exist, it is very much possible to invert it. Once inverted, we can use it as a generative model as follows

$x_{T} \sim N (0, I) ⟹ T^{- 1} (x_{T}) \sim p_{data}$

Fortunately, the theroy of markov chain gurantees that for gaussian diffusion, there indeed exists a reverse diffusion process $T^{- 1}$ . The original paper from Sohl-Dickstein et al. (2015) shows how a parametric model of diffusion $T_{θ}^{- 1}$ can be learned from data itself.

Graphical model and training

Figure 2: Diffusion Model’s underlying graphical model

The stochastic “forward diffusion” and “reverse diffusion” processes described so far can be well expressed in terms of Probabilistic Graphical Models (PGMs). A series of $T$ random variables define each of them; with the forward process being fully described by Equation 3. The reverse process is expressed by a parametric graphical model very similar to that of the forward process, but in reverse

$\begin{matrix} (4) & p (x_{T}) = N (x_{T}; 0, I) p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) \end{matrix}$

Each of the reverse conditionals $p_{θ} (x_{t - 1} | x_{t})$ are structurally gaussians and responsible for learning to revert each corresponding steps in the forward process, i.e. $q (x_{t} | x_{t - 1})$ . The means and covariances of these reverse conditionals are neural networks with parameters $θ$ and shared over timesteps. Just like any other probabilistic models, we wish to minimize the negative log-likelihood of the model distribution under the expectation of data distribution

$L = E_{x_{0} \sim p_{data}} [- \log p_{θ} (x_{0})]$

which isn’t quite computable in practice due to its dependance on $(T - 1)$ more random variables. With a fair bit of mathematical manipulations, Sohl-Dickstein et al. (2015) (section 2.3) showed $L$ to be a lower-bound of another easily computatable quantity

$L \leq E_{x_{0} \sim p_{data}, x_{1 : T} \sim q (x_{1 : T} | x_{0})} [- \log p (x_{T}) - \sum_{t \geq 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}]$

which is easy to compute and optimize. The expectation is over the joint distribution of the entire forward process. Getting a sample $x_{1 : T} \sim q (\cdot | x_{0})$ boils down to executing the forward diffusion on one sample $x_{0} \sim p_{data}$ . All quantities inside the expectation are tractable and available to us in closed form.

Further simplification: Variance Reduction

Even though we can train the model with the lower-bound shown above, few more simplifications are possible. First one is due to Sohl-Dickstein et al. (2015) and in an attempt to reduce variance. Firstly, they showed that the lower-bound can be further simplified and re-written as the following

$L \leq E_{x_{0}, x_{1 : T}} [\underset{Independent of θ}{\underset{⏟}{D_{KL} [q (x_{T} | x_{0}) | p (x_{T})]}} + \sum_{t = 1}^{T} D_{KL} [q (x_{t - 1} | x_{t}, x_{0}) | p_{θ} (x_{t - 1} | x_{t})]]$

There is a subtle approximation involved (the edge case of $t = 1$ in the summation) in the above expression which is due to Ho, Jain, and Abbeel (2020) (section 3.3 and 3.4). The noticable change in this version is the fact that all conditionals $q (\cdot | \cdot)$ of the forward process are now additionally conditioned on $x_{0}$ . Earlier, the corresponding quantities had high uncertainty/variance due to different possible choices of the starting point $x_{0}$ , which are now suppressed by the additional knowledge of $x_{0}$ . Moreover, it turns out that $q (x_{t - 1} | x_{t}, x_{0})$ has a closed form

$q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; {\tilde{μ}}_{t} (x_{t}, x_{0}), \tilde{β_{t}} I)$

The exact form (refer to Equation 7 of Ho, Jain, and Abbeel (2020)) is not important for holistic understanding of the algorithm. Only thing to note is that ${\tilde{μ}}_{t} (x_{t}, x_{0})$ additionally contains $β_{t}$ (fixed numbers) and $\tilde{β_{t}}$ is a function of $β_{t}$ only. Moving on, we do the following on the last expression for $L$

Use the closed form of $p_{θ} (x_{t - 1} | x_{t})$ in Equation 4 with $Σ_{θ} (x_{t}, t) = \tilde{β_{t}} I$ (design choice for making things simple)
Expand the KL divergence formula
Convert $\sum_{t = 1}^{T}$ into expectation (over $t \sim U [1, T]$ ) by scaling with a constant $1 / T$

.. and arrive at a simpler form

$\begin{matrix} (5) & L \leq E_{x_{0}, x_{1 : T}, t} [\frac{1}{2 \tilde{β_{t}}} | | {\tilde{μ}}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) | |^{2}] \end{matrix}$

Further simplification: Forward re-parameterization

For the second simplification, we look at the forward process in a bit more detail. There is an amazing property of the forward diffusion with gaussian noise - the distribution of the noisy sample $x_{t}$ can be readily calculated given real data $x_{0}$ without touching any other steps.

$q (x_{t} | x_{0}) = N (x_{t}; \sqrt{\underset{α_{t}}{\underset{⏟}{\prod_{s = 1}^{t} (1 - β_{s})}}} \cdot x_{0}, (1 - \underset{α_{t}}{\underset{⏟}{\prod_{s = 1}^{t} (1 - β_{s})}}) \cdot I)$

This is a consequence of the forward process being completely known and having well-defined probabilistic structure (gaussian noise). By means of (gaussian) reparameterization, we can also derive an easy way of sampling any $x_{t}$ only from standard gaussian noise vector $ϵ \sim N (0, I)$

$\begin{matrix} (6) & x_{t} (x_{0}, ϵ) = \sqrt{α_{t}} \cdot x_{0} + \sqrt{1 - α_{t}} \cdot ϵ \end{matrix}$

As a result, $x_{1 : T}$ need not be sampled with ancestral sampling (refer to Equation 2 & 3), but only require computing Equation 6 with all $t$ in any order. This further simpifies the expectation in Equation 5 to (changes highlighted in blue)

$\begin{matrix} (7) & L \leq E_{x_{0}, ϵ, t} [\frac{1}{2 \tilde{β_{t}}} | | {\tilde{μ}}_{t} (x_{t} (x_{0}, ϵ), x_{0}) - μ_{θ} (x_{t} (x_{0}, ϵ), t) | |^{2}] \end{matrix}$

This is the final form that can be implemented in practice as suggested by Ho, Jain, and Abbeel (2020).

Connection to Score-based models (SBM)

Ho, Jain, and Abbeel (2020) uncovered a link between Equation 7 and a particular Score-based models known as Noise Conditioned Score Network (NCSN) (Song and Ermon 2019). With the help of the reparameterized form of $x_{t}$ (Equation 6) and the functional form of ${\tilde{μ}}_{t}$ , one can easily (with few simplification steps) reduce Equation 7 to

$L \leq E_{x_{0}, ϵ, t} [\frac{1}{2 \tilde{β_{t}}} {| | \frac{1}{\sqrt{1 - β_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - α_{t}}} ϵ) - μ_{θ} (x_{t}, t) | |}^{2}]$

The above equation is a simple regression with $μ_{θ}$ being the parametric model (neural net in practice) and the quantity in blue is its regression target. Without loss of generality, we can slightly modify the definition of the parametric model to be $μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{1 - β_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - α_{t}}} ϵ_{θ} (x_{t}, t))$ . The only “moving part” in the new parameterization is $ϵ_{θ} (\cdot)$ ; the rest (i.e. $x_{t}$ and $t$ ) are explicitly available to the model. This leads to the following form

$L \leq E_{x_{0}, ϵ, t} [\frac{β_{t}^{2}}{2 \tilde{β_{t}} (1 - β_{t}) (1 - α_{t})} | | ϵ - ϵ_{θ} (\underset{x_{t}}{\underset{⏟}{\sqrt{α_{t}} \cdot x_{0} + \sqrt{1 - α_{t}} \cdot ϵ}}, t) | |^{2}]$ $\approx E_{t} [E_{x_{0}, ϵ} [| | ϵ - ϵ_{θ} (x_{t}, t) | |^{2}]] \approx \frac{1}{T} \sum_{t = 1}^{T} p (t) E_{x_{0}, ϵ} [| | ϵ - ϵ_{θ} (x_{t}, t) | |^{2}]$

The expression in red can be discarded without any effect in performance (suggested by Ho, Jain, and Abbeel (2020)). I have further approximated the expectation over time-steps with sample average. If you look at the final form, you may notice a surprising resemblance with Noise Conditioned Score Network (NCSN) (Song and Ermon 2019). Please refer to $J_{ncsn}$ in my blog on score models. Below I pin-point the specifics:

The time-steps $t = 1, 2, \dots T$ resemble the increasing “noise-scales” in NCSN. Recall that the noise increases as the forward diffusion approaches the end.
The expectation $E_{x_{0}, ϵ}$ (for each scale) holistically matches that of denoising score matching objective, i.e. $E_{x, \tilde{x}}$ . In case of Diffusion Model, The noisy sample can be readily computed using the noise vector $ϵ$ (refer to Equation 6).
Just like NCSN, the regression target is the noise vector $ϵ$ for each time-step (or scale).
Just like NCSN, the learnable model depends on the noisy sample and the time-step (or scale).

Infinitely many noise scales & continuous-time analogue

Inspired by the connection between Diffusion Model and Score-based models, Song, Kingma, et al. (2021) proposed to use infinitely many noise scales (equivalently time-steps). At first, it might look like a trivial increase in number of steps/scales, but there happened to be a principled way to achieve it, namely Stochastic Differential Equations or SDEs. Song, Kingma, et al. (2021) reworked the whole formulation considering continuous SDE as forward diffusion. Interestingly, it turned out that the reverse process is also an SDE that depends on the score function.

Quite simply, finite time-steps/scales (i.e. $t = 0, 1, \dots T$ ) are replaced by infinitely many segments (of length $Δ t \to 0$ ) within time-range $[0, T]$ . Instead of $x_{t}$ at every discrete time-step/scale, we define a continuous random process $x (t)$ indexed by continuous time $t$ . We also replace the discrete-time conditionals in Equation 3 with continuous analogues. But this time, we define the “increaments” in each step rather than absolute values, i.e. the transition kernel specifies what to add to the previous value. Specifically, we define a general form of continuous forwrad diffusion with

$\begin{matrix} (8) & q (x (t + Δ t) - x (t) | x (t)) = N (f (x (t), t) Δ t, g^{2} (t) Δ t^{2}) \end{matrix}$

$or, x (t + Δ t) - x (t) = f (x (t), t) Δ t + g (t) \cdot \underset{Δ ω}{\underset{⏟}{Δ t \cdot ϵ}}, with ϵ \sim N (0, I)$

If you have ever studied SDEs, you might recognize that Equation 8 resembles Euler–Maruyama numerical solver for SDEs. Considering $f (x (t), t)$ to be the “Drift function”, $g (t)$ be the “Diffusion function” and $Δ ω \sim N (0, Δ t)$ being the discrete differential of Wiener Process $ω (t)$ , in the limit of $Δ t \to 0$ , the following SDE can be recovered

$d x (t) = f (x (t), t) \cdot d t + g (t) \cdot d ω (t), with d ω (t) \sim N (0, d t)$

A visualization of the continuous forward diffusion in 1D is given in Figure 3 for a set of samples (different colors).

Song, Kingma, et al. (2021) (section 3.4) proposed few different choices ${f, g}$ named Variance Exploding (VE), Variance Preserving (VP) and sub-VP. The one that resembles Equation 3 (discrete forward diffusion) in continuous time and ensures proper diffusion, i.e. $x (0) \sim p_{data} ⟹ x (T) \sim N (0, I)$ is $f (x (t), t) = - \frac{1}{2} β (t) x (t), g (t) = \sqrt{β (t)}$ . This particular SDE is termed as “Variance Preserving (VP) SDE” since the variance of $x (t)$ is finite as long as the variance of $x (0)$ if finite (Appendix B of Song, Kingma, et al. (2021)). We can enforce the covariance of $x (0)$ to be $I$ simply by standardizing our dataset.

An old (but remarkable) result due to Anderson (1982) shows that the above forward diffusion can be reversed even in closed form, thanks to the following SDE

$\begin{matrix} (9) & d x (t) = [f (x (t), t) - g^{2} (t) \underset{score s (x (t), t)}{\underset{⏟}{\nabla_{x} \log p (x (t))}}] d t + g (t) d ω (t) \end{matrix}$

Hence, the reverse diffusion is simply solving the above SDE in reverse time with initial state $x (T) \sim N (0, I)$ , leading to $x (0) \sim p_{data}$ . The only missing part is the score, i.e. $s (x (t), t) ≜ \nabla_{x} \log p (x (t))$ . Thankfully, we have already seen how score estimation works and that is pretty much what we do here. There are two ways, as explained in my blog on score models. I briefly go over them below in the context of continuous SDEs:

1. Implicit Score Matching (ISM)

The easier way is to use the Hutchinson trace-estimator based score matching proposed by Song et al. (2020) called “Sliced Score Matching”.

$J_{I} (θ) = E_{v \sim N (0, I)} E_{x (0) \sim p_{data}} E_{x (0 < t \leq T) \sim q (\cdot | x (0))} [\frac{1}{2} {| | s_{θ} (x (t), t) | |}^{2} + v^{T} \nabla_{x} s_{θ} (x (t), t) v]$

Very similar to NCSN, we define a parametric score network $s_{θ} (x (t), t)$ dependent on continuous time/scale $t$ . Starting from data samples $x (0) \sim p_{data}$ , we can generate the rest of the forward chain $x (0 < t \leq T)$ simply by executing a solver (refer to Equation 8) on the SDE at any required precision (discretization).

2. Denoising Score Matching (DSM)

There is the other “Denoising score matching (DSM)” way of training $s_{θ}$ , which is slightly more complicated. At its core, the DSM objective for continuous diffusion is a continuous analogue of the discrete DSM objective.

$J_{D} (θ) = \frac{1}{T} \sum_{t = 1}^{T} p (t) E_{x (0), x (t)} [| | s_{θ} (x (t), t) - \nabla_{x (t)} \log p (x (t) | x (0)) | |^{2}]$

Remember that in case of continuous diffusion, we never explicitly modelled the reverse conditionals $p (x (t) | x (0))$ . The reverse diffusion was defined rather implicitly (Equation 9). Hence, the quantity in blue, unlike its discrete counterpart, isn’t very easy to compute in general. However, due to Särkkä and Solin (2019) there is an easy closed form for it when the drift function $f$ is affine in nature. Thankfully, our specific choice of $f (x (t), t)$ is indeed affine.

$p (x (t) | x (0)) = N (x (t); x (0) e^{- 0.5 \int_{0}^{t} β (s) d s}, I - I e^{- 0.5 \int_{0}^{t} β (s) d s})$

Since the conditionals are gaussian (again), its pretty easy to derive the expression for $\nabla_{x (t)} \log p (\cdot)$ . I leave it for the readers to try.

Computing log-likelihoods

One of the core reasons score models exist is the fact that it bypasses the need for training explicit log-likelihoods which are difficult to compute for a large range of powerful models. Turns out that in case of continuous diffusion models, there is an indirect way to evaluate the very log-likelihood. Let’s focus on the “generative process” of continuous diffusion models, i.e. the reverse diffusion in Equation 9. What we would like to compute is $p (x (0))$ when $x (0)$ is generated by solving the SDE in Equation 9 backwards with $x (T) \sim N (0, I)$ . Even though it is hard to compute marginal likelihoods $p (x (t))$ for any $t$ , it turns out there is exists a deterministic ODE (Ordinary Differential Equation) against the SDE in Equation 9 whose marginal likelihoods match that of the SDE for every $t$

$\frac{d x (t)}{d t} = [f (x (t), t) - g^{2} (t) \underset{\approx s_{θ} (x (t), t)}{\underset{⏟}{\nabla_{x} \log p (x (t))}}] ≜ F (x (t), t)$

Note that the above ODE is essentially the same SDE but without the source of randomness. After learning the score (as usual), we simply drop-in replace the SDE with the above ODE. Now all thanks to Chen et al. (2018), this problem has already been solved. It is known as Continuous Normalizing Flow (CNF) whereby given $\log p (x (T))$ , we can calculate $\log p (x (0))$ by solving the following ODE with any numerical solver for $t = T \to 0$

$\frac{\partial}{\partial t} \log p (x (t)) = - tr (\frac{d}{d x (t)} F (x (t), t))$

Please remember that this way of computing log-likelihood is merely an utility and cannot be used to train the model. A more recent paper (Song, Durkan, et al. 2021) however, shows a way to train SDE based continuous diffusion models by directly optimizing (a bound on) log-likelihood under some condition, which may be the topic for another article. I encourage readers to explore it themselves.

References

Anderson, Brian DO. 1982. “Reverse-Time Diffusion Equation Models.” Stochastic Processes and Their Applications 12 (3): 313–26.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.” In NeurIPS.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” In NeurIPS.

Särkkä, Simo, and Arno Solin. 2019. Applied Stochastic Differential Equations. Vol. 10. Cambridge University Press.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” In ICML.

Song, Yang, Conor Durkan, Iain Murray, and Stefano Ermon. 2021. “Maximum Likelihood Training of Score-Based Diffusion Models.” In Advances in Neural Information Processing Systems, edited by A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan.

Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by Estimating Gradients of the Data Distribution.” In Advances in Neural Information Processing Systems, 11895–907.

Song, Yang, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. 2020. “Sliced Score Matching: A Scalable Approach to Density and Score Estimation.” In Uncertainty in Artificial Intelligence, 574–84. PMLR.

Song, Yang, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In ICLR.

Citation

BibTeX citation:

@online{das2021,
  author = {Das, Ayan},
  title = {An Introduction to {Diffusion} {Probabilistic} {Models}},
  date = {2021-12-04},
  url = {https://ayandas.me/blogs/2021-12-04-diffusion-prob-models.html},
  langid = {en}
}

For attribution, please cite this work as:

Das, Ayan. 2021. “An Introduction to Diffusion Probabilistic Models.” December 4, 2021. https://ayandas.me/blogs/2021-12-04-diffusion-prob-models.html.