Jekyll2024-02-26T13:48:44+00:00https://ayandas.me/feed.xmlAyan Das<b>Ph.D. Student</b> @ <a href="https://www.surrey.ac.uk/">University of Surrey</a>; Senior DL Researcher @ <a href="https://www.mtkresearch.com/en">MediaTek Research</a>Ayan Dasa.das@surrey.ac.ukBuilding Diffusion Model’s theory from ground up2024-02-15T00:00:00+00:002024-02-15T00:00:00+00:00https://ayandas.me/blog-tut/2024/02/15/diffusion-theory-from-scratchAyan DasBuilding Diffusion Model’s theory from ground up2024-02-15T00:00:00+00:002024-02-15T00:00:00+00:00https://ayandas.me/pubs/2024/02/15/pub-14
Blog (self hosted)

Diffusion Model, a new generative model family, has taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chain, their fundamental principle lies in the decade-old theory of Stochastic Differential Equation (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation, and show how the idea can be 'shaped' to get to the modern form of Score-based Diffusion Models. We'll start from the very definition of 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees, how the design choices were made and finally arrive at the more 'principled' framework of Score-based Diffusion. Throughout the article, we provide several intuitive illustrations for ease of understanding.

@inproceedings{das2024buildingdiffusionmodels,
author = {Das, Ayan},
title = {Building Diffusion Model's theory from ground up},
booktitle = {ICLR Blogposts 2024},
year = {2024},
date = {May 7, 2024},
note = {https://ayandas.me/2024/blog/diffusion-theory-from-scratch/},
url = {https://ayandas.me/2024/blog/diffusion-theory-from-scratch/}
}

]]>Ayan Das, <a href="https://www.mtkresearch.com/en/" target="_blank">MediaTek Research UK</a>Score Normalization for a Faster Diffusion Exponential Integrator Sampler2023-11-01T00:00:00+00:002023-11-01T00:00:00+00:00https://ayandas.me/pub-13
Paper (with Suppl.)

Recently, Zhang et al. have proposed the Diffusion Exponential Integrator Sampler (DEIS) for fast generation of samples from Diffusion Models. It leverages the semi-linear nature of the probability flow ordinary differential equation (ODE) in order to greatly reduce integration error and improve generation quality at low numbers of function evaluations (NFEs). Key to this approach is the score function reparameterisation, which reduces the integration error incurred from using a fixed score function estimate over each integration step. The original authors use the default parameterisation used by models trained for noise prediction -- multiply the score by the standard deviation of the conditional forward noising distribution. We find that although the mean absolute value of this score parameterisation is close to constant for a large portion of the reverse sampling process, it changes rapidly at the end of sampling. As a simple fix, we propose to instead reparameterise the score (at inference) by dividing it by the average absolute value of previous score estimates at that time step collected from offline high NFE generations. We find that our score normalisation (DEIS-SN) consistently improves FID compared to vanilla DEIS, showing an FID improvement from 6.44 to 5.57 at 10 NFEs for our CIFAR-10 experiments.

@inproceedings{
xia2023score,
title={Score Normalization for a Faster Diffusion Exponential Integrator Sampler},
author={Guoxuan Xia and
Duolikun Danier and
Ayan Das and
Stathi Fotiadis and
Farhang Nabiei and
Ushnish Sengupta and
Alberto Bernacchia},
booktitle={NeurIPS 2023 Workshop on Diffusion Models},
year={2023},
url={https://openreview.net/forum?id=AQvPfN33g9}
}

]]>Guoxuan XiaImage generation with Shortest Path Diffusion2023-04-25T00:00:00+00:002023-04-25T00:00:00+00:00https://ayandas.me/pub-12
Paper (with Suppl.)

The field of image generation has made significant progress thanks to the introduction of Diffusion Models, which learn to progressively reverse a given image corruption. Recently, a few studies introduced alternative ways of corrupting images in Diffusion Models, with an emphasis on blurring. However, these studies are purely empirical and it remains unclear what is the optimal procedure for corrupting an image. In this work, we hypothesize that the optimal procedure minimizes the length of the path taken when corrupting an image towards a given final state. We propose the Fisher metric for the path length, measured in the space of probability distributions. We compute the shortest path according to this metric, and we show that it corresponds to a combination of image sharpening, rather than blurring, and noise deblurring. While the corruption was chosen arbitrarily in previous work, our Shortest Path Diffusion (SPD) determines uniquely the entire spatiotemporal structure of the corruption. We show that SPD improves on strong baselines without any hyperparameter tuning, and outperforms all previous Diffusion Models based on image blurring. Furthermore, any small deviation from the shortest path leads to worse performance, suggesting that SPD provides the optimal procedure to corrupt images. Our work sheds new light on observations made in recent works, and provides a new approach to improve diffusion models on images and other types of data.

Slides for our
ICML 2023 talk

PS: Reusing any of these slides would require permission from the author.

@inproceedings{das2023spdiffusion,
title={Image generation with shortest path diffusion},
author={Ayan Das and
Stathi Fotiadis and
Anil Batra and
Farhang Nabiei and
FengTing Liao and
Sattar Vakili and
Da-shan Shiu and
Alberto Bernacchia
},
booktitle={International Conference on Machine Learning},
year={2023}
}

]]>Ayan DasChiroDiff: Modelling chirographic data with Diffusion Models2023-01-21T00:00:00+00:002023-01-21T00:00:00+00:00https://ayandas.me/pub-11
Paper (with Suppl.)

Generative modelling over continuous-time geometric constructs, a.k.a chirographic data such as handwriting, sketches, drawings etc., have been accomplished through autoregressive distributions. Such strictly-ordered discrete factorization however falls short of capturing key properties of chirographic data -- it fails to build holistic understanding of the temporal concept due to one-way visibility (causality). Consequently, temporal data has been modelled as discrete token sequences of fixed sampling rate instead of capturing the true underlying concept. In this paper, we introduce a powerful model-class namely Denoising Diffusion Probabilistic Models or DDPMs for chirographic data that specifically addresses these flaws. Our model named ChiroDiff, being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate up to a good extent. Moreover, we show that many important downstream utilities (e.g. conditional sampling, creative mixing) can be flexibly implemented using ChiroDiff. We further show some unique use-cases like stochastic vectorization, de-noising/healing, abstraction are also possible with this model-class. We perform quantitative and qualitative evaluation of our framework on relevant datasets and found it to be better or on par with competing approaches.

Slides for my
ICLR 2023 talk

PS: Reusing any of these slides would require permission from the author.

@inproceedings{
das2023chirodiff,
title={ChiroDiff: Modelling chirographic data with Diffusion Models},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=1ROAstc9jv}
}

]]>Ayan DasSketchODE: Learning neural sketch representation in continuous time2022-01-21T00:00:00+00:002022-01-21T00:00:00+00:00https://ayandas.me/pub-10
Paper (with Suppl.)

Learning meaningful representations for chirographic drawing data such as sketches, handwriting, and flowcharts is a gateway for understanding and emulating human creative expression. Despite being inherently continuous-time data, existing works have treated these as discrete-time sequences, disregarding their true nature. In this work, we model such data as continuous-time functions and learn compact representations by virtue of Neural Ordinary Differential Equations. To this end, we introduce the first continuous-time Seq2Seq model and demonstrate some remarkable properties that set it apart from traditional discrete-time analogues. We also provide solutions for some practical challenges for such models, including introducing a family of parameterized ODE dynamics & continuous-time data augmentation particularly suitable for the task. Our models are validated on several datasets including VectorMNIST, DiDi and Quick, Draw!.

Slides for my
ICLR 2022 talk

PS: Reusing any of these slides would require permission from the author.

@inproceedings{
das2022sketchode,
title={Sketch{ODE}: Learning neural sketch representation in continuous time},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=c-4HSDAWua5}
}

]]>Ayan DasAn introduction to Diffusion Probabilistic Models2021-12-04T00:00:00+00:002021-12-04T00:00:00+00:00https://ayandas.me/blog-tut/2021/12/04/diffusion-prob-modelsGenerative modelling is one of the seminal tasks for understanding the distribution of natural data. VAE, GAN and Flow family of models have dominated the field for last few years due to their practical performance. Despite commercial success, their theoretical and design shortcomings (intractable likelihood computation, restrictive architecture, unstable training dynamics etc.) have led to the developement of a new class of generative models named “Diffusion Probabilistic Models” or DPMs. Diffusion Models, first proposed by Sohl-Dickstein et al., 2015, inspire from thermodynam diffusion process and learn a noise-to-data mapping in discrete steps, very similar to Flow models. Lately, DPMs have been shown to have some intriguing connections to Score Based Models (SBMs) and Stochastic Differential Equations (SDE). These connections have further been leveraged to create their continuous-time analogous. In this article, I will describe both the general framework of DPMs, their recent advancements and explore the connections to other frameworks. For the sake of readers, I will avoid gory details, rigorous mathematical derivations and use subtle simplifications in order to maintain focus on the core idea.

In case you haven’t checked the first part of this two-part blog, please read Score Based Models (SBMs)

What exactly do we mean by “Diffusion” ?

In thermodynamics, “Diffusion” refers to the flow of particles from high-density regions towards low-density regions. In the context of statistics, meaning of Diffusion is quite similar, i.e. the process of transforming a complex distribution \(p_{\mathrm{complex}}\) on \(\mathbb{R}^d\) to a simple (predefined) distribution \(p_{\mathrm{prior}}\) on the same domain. Succinctly, a transformation \(\mathcal{T}: \mathbb{R}^d \rightarrow \mathbb{R}^d\) such that

where the symbol \(\implies\) means “implies”. There is a formal way to come up with a specific specific \(\{ \mathcal{T}, p_{\mathrm{prior}} \}\) pair that satisfies Eq. 1 for any distribution \(p_{\mathrm{complex}}\). In simple terms, we can take any distribution and transform it into a known (simple) density by means of a known transformation \(\mathcal{T}\). By “formal way”, I was referring to Markov Chain and its stationary distribution, which says that by repeated application of a transition kernel \(q(\mathbf{x} \vert \mathbf{x}')\) on the samples of any distribution would lead to samples from \(p_{\mathrm{prior}}(\mathbf{x})\) if the following holds

We can related our original diffusion process in Eq. 1 with a markov chain by defining \(\mathcal{T}\) to be repeated application of the transition kernel \(q(\mathbf{x} \vert \mathbf{x}')\) over discrete time \(t\)

From the properties of stationary distribution, we have \(\mathbf{x}_{\infty} \sim p_{\mathrm{prior}}\). In practice, we can keep the iterations to a sufficiently large finite number \(t = T\).

So far, we confirmed that there is indeed an iterative way (refer to Eq. 2) to convert the samples from a complex distributions to a known (simple) prior. Even though we talked only in terms of generic densities, there is one very attractive choice of \(\{ q, p_{\mathrm{prior}} \}\) pair (showed in Sohl-Dickstein et al., 2015) due to its simplicity and tractability

For obvious reason, its known as Gaussian Diffusion. I purposefuly changed the notations of the random variables to make it more explicit. \(\beta_t \in \mathbb{R}\) is a predefined decaying schedule proposed by Sohl-Dickstein et al., 2015. A pictorial depiction of the diffusion process is shown in the diagram below.

Generative modlling by undoing the diffusion

We proved the existence of a stochastic transform \(\mathcal{T}\) that gurantees the diffusion process in Eq 1. Please realize that the diffusion process does not depend on the initial density \(p_{\mathrm{complex}}\) (as \(t \rightarrow \infty\)) and the only requirement is to be able to sample from it. This is the core idea behind Diffusion Models - we use the any data distribution (let’s say \(p_{\mathrm{data}}\)) of our choice as the complex initial density. This leads to the forward diffusion process

This process is responsible for “destructuring” the data and turning it into an isotropic gaussian (almost structureless). Please refer to the figure below (red part) for a visual demonstration.

However, this isn’t very usefull by itself. What would be useful is doing the opposite, i.e. starting from isotropic gaussian noise and turning it into \(p_{\mathrm{data}}\) - that is generative modelling (blue part of the figure above). Since the forward process is fixed (non-parametric) and guranteed to exist, it is very much possible to invert it. Once inverted, we can use it as a generative model as follows

Fortunately, the theroy of markov chain gurantees that for gaussian diffusion, there indeed exists a reverse diffusion process \(\mathcal{T}^{-1}\). The original paper from Sohl-Dickstein et al., 2015 shows how a parametric model of diffusion \(\mathcal{T}^{-1}_{\theta}\) can be learned from data itself.

Graphical model and training

The stochastic “forward diffusion” and “reverse diffusion” processes described so far can be well expressed in terms of Probabilistic Graphical Models (PGMs). A series of \(T\) random variables define each of them; with the forward process being fully described by Eq. 3. The reverse process is expressed by a parametric graphical model very similar to that of the forward process, but in reverse

Each of the reverse conditionals \(p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) are structurally gaussians and responsible for learning to revert each corresponding steps in the forward process, i.e. \(q(\mathbf{x}_t \vert \mathbf{x}_{t-1})\). The means and covariances of these reverse conditionals are neural networks with parameters \(\theta\) and shared over timesteps. Just like any other probabilistic models, we wish to minimize the negative log-likelihood of the model distribution under the expectation of data distribution

which isn’t quite computable in practice due to its dependance on \((T-1)\) more random variables. With a fair bit of mathematical manipulations, Sohl-Dickstein et al., 2015 (section 2.3) showed \(\mathcal{L}\) to be a lower-bound of another easily computatable quantity

which is easy to compute and optimize. The expectation is over the joint distribution of the entire forward process. Getting a sample \(\mathbf{x}_{1:T} \sim q(\cdot \vert \mathbf{x}_0)\) boils down to executing the forward diffusion on one sample \(\mathbf{x}_0 \sim p_{\mathrm{data}}\). All quantities inside the expectation are tractable and available to us in closed form.

Further simplification: Variance Reduction

Even though we can train the model with the lower-bound shown above, few more simplifications are possible. First one is due to Sohl-Dickstein et al., 2015 and in an attempt to reduce variance. Firstly, they showed that the lower-bound can be further simplified and re-written as the following

There is a subtle approximation involved (the edge case of \(t=1\) in the summation) in the above expression which is due to Ho et al., 2021 (section 3.3 and 3.4). The noticable change in this version is the fact that all conditionals \(q(\cdot \vert \cdot)\) of the forward process are now additionally conditioned on \(\mathbf{x}_0\). Earlier, the corresponding quantities had high uncertainty/variance due to different possible choices of the starting point \(\mathbf{x}_0\), which are now suppressed by the additional knowledge of \(\mathbf{x}_0\). Moreover, it turns out that \(q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0)\) has a closed form

The exact form (refer to Eq. 7 of Ho et al., 2021) is not important for holistic understanding of the algorithm. Only thing to note is that \(\mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)\) additionally contains \(\beta_t\) (fixed numbers) and \(\mathbf{\tilde{\beta_t}}\) is a function of \(\beta_t\) only. Moving on, we do the following on the last expression for \(\mathcal{L}\)

Use the closed form of \(p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) in Eq. 4 with \(\mathbf{\Sigma}_{\theta}(\mathbf{x}_t, t) = \mathbf{\tilde{\beta_t}}\mathrm{I}\) (design choice for making things simple)

Expand the KL divergence formula

Convert \(\sum_{t=1}^T\) into expectation (over \(t \sim \mathcal{U}[1, T]\)) by scaling with a constant \(1/T\)

Further simplification: Forward re-parameterization

For the second simplification, we look at the forward process in a bit more detail. There is an amazing property of the forward diffusion with gaussian noise - the distribution of the noisy sample \(\mathbf{x}_t\) can be readily calculated given real data \(\mathbf{x}_0\) without touching any other steps.

This is a consequence of the forward process being completely known and having well-defined probabilistic structure (gaussian noise). By means of (gaussian) reparameterization, we can also derive an easy way of sampling any \(\mathbf{x}_t\) only from standard gaussian noise vector \(\epsilon \sim \mathcal{N}(0, \mathrm{I})\)

As a result, \(\mathbf{x}_{1:T}\) need not be sampled with ancestral sampling (refer to Eq. 2 & 3), but only require computing Eq. 6 with all \(t\) in any order. This further simpifies the expectation in Eq. 5 to (changes highlighted in blue)

This is the final form that can be implemented in practice as suggested by Ho et al., 2021.

Connection to Score-based models (SBM)

Ho et al., 2021 uncovered a link between Eq. 7 and a particular Score-based models known as Noise Conditioned Score Network (NCSN). With the help of the reparameterized form of \(\mathbf{x}_t\) (Eq. 6) and the functional form of \(\mathbf{\tilde{\mu}}_t\), one can easily (with few simplification steps) reduce Eq. 7 to

The above equation is a simple regression with \(\mathbf{\mu}_{\theta}\) being the parametric model (neural net in practice) and the quantity in blue is its regression target. Without loss of generality, we can slightly modify the definition of the parametric model to be \(\mathbf{\mu}_{\theta}(\mathbf{x}_t, t) = \frac{1}{\sqrt{1-\beta_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}} \epsilon_{\theta}(\mathbf{x}_t, t) \right)\). The only “moving part” in the new parameterization is \(\epsilon_{\theta}(\cdot)\); the rest (i.e. \(\mathbf{x}_t\) and \(t\)) are explicitly available to the model. This leads to the following form

The expression in red can be discarded without any effect in performance (suggested by Ho et al., 2021). I have further approximated the expectation over time-steps with sample average. If you look at the final form, you may notice a surprising resemblance with Noise Conditioned Score Network (NCSN). Please refer to \(J_{\mathrm{ncsn}}\) in my blog on score models. Below I pin-point the specifics:

The time-steps \(t=1, 2, \cdots T\) resemble the increasing “noise-scales” in NCSN. Recall that the noise increases as the forward diffusion approaches the end.

The expectation \(\mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon}}\) (for each scale) holistically matches that of denoising score matching objective, i.e. \(\mathbb{E}_{\mathbf{x}, \mathbf{\tilde{x}}}\). In case of Diffusion Model, The noisy sample can be readily computed using the noise vector \(\epsilon\) (refer to Eq. 6).

Just like NCSN, the regression target is the noise vector \(\epsilon\) for each time-step (or scale).

Just like NCSN, the learnable model depends on the noisy sample and the time-step (or scale).

Infinitely many noise scales & continuous-time analogue

Inspired by the connection between Diffusion Model and Score-based models, Song et al., 2021 proposed to use infinitely many noise scales (equivalently time-steps). At first, it might look like a trivial increase in number of steps/scales, but there happened to be a principled way to achieve it, namely Stochastic Differential Equations or SDEs. Song et al., 2021 reworked the whole formulation considering continuous SDE as forward diffusion. Interestingly, it turned out that the reverse process is also an SDE that depends on the score function.

Quite simply, finite time-steps/scales (i.e. \(t = 0, 1, \cdots T\)) are replaced by infinitely many segments (of length \(\Delta t \rightarrow 0\)) within time-range \([0, T]\). Instead of \(\mathbf{x}_t\) at every discrete time-step/scale, we define a continuous random process \(\mathbf{x}(t)\) indexed by continuous time \(t\). We also replace the discrete-time conditionals in Eq. 3 with continuous analogues. But this time, we define the “increaments” in each step rather than absolute values, i.e. the transition kernel specifies what to add to the previous value. Specifically, we define a general form of continuous forwrad diffusion with

If you have ever studied SDEs, you might recognize that Eq. 9 resembles Euler–Maruyama numerical solver for SDEs. Considering \(f(\mathbf{x}(t), t)\) to be the “Drift function”, \(g(t)\) be the “Diffusion function” and \(\Delta \omega \sim \mathcal{N}(0, \Delta t)\) being the discrete differential of Wiener Process \(\omega(t)\), in the limit of \(\Delta t \rightarrow 0\), the following SDE can be recovered

A visualization of the continuous forward diffusion in 1D is given below for a set of samples (different colors).

Song et al., 2021 (section 3.4) proposed few different choices \(\{f, g\}\) named Variance Exploding (VE), Variance Preserving (VP) and sub-VP. The one that resembles Eq. 3 (discrete forward diffusion) in continuous time and ensures proper diffusion, i.e. \(\mathbf{x}(0) \sim p_{\mathrm{data}} \implies \mathbf{x}(T) \sim \mathcal{N}(0, \mathrm{I})\) is \(f(\mathbf{x}(t), t) = -\frac{1}{2}\beta(t)\mathbf{x}(t),\ g(t) = \sqrt{\beta(t)}\). This particular SDE is termed as “Variance Preserving (VP) SDE” since the variance of \(\mathbf{x}(t)\) is finite as long as the variance of \(\mathbf{x}(0)\) if finite (Appendix B of Song et al., 2021). We can enforce the covariance of \(\mathbf{x}(0)\) to be \(\mathrm{I}\) simply by standardizing our dataset.

An old (but remarkable) result due to Anderson, 1982 shows that the above forward diffusion can be reversed even in closed form, thanks to the following SDE

Hence, the reverse diffusion is simply solving the above SDE in reverse time with initial state \(\mathbf{x}(T) \sim \mathcal{N}(0, \mathrm{I})\), leading to \(\mathbf{x}(0) \sim p_{\mathrm{data}}\). The only missing part is the score, i.e. \(\mathbf{s}(\mathbf{x}(t), t) \triangleq \nabla_{\mathbf{x}} \log p(\mathbf{x}(t))\). Thankfully, we have already seen how score estimation works and that is pretty much what we do here. There are two ways, as explained in my blog on score models. I briefly go over them below in the context of continuous SDEs:

1. Implicit Score Matching (ISM)

The easier way is to use the Hutchinson trace-estimator based score matching proposed by Song et al., 2019 called “Sliced Score Matching”.

Very similar to NCSN, we define a parametric score network \(\mathbf{s}_{\theta}(\mathbf{x}(t), t)\) dependent on continuous time/scale \(t\). Starting from data samples \(\mathbf{x}(0)\sim p_{\mathrm{data}}\), we can generate the rest of the forward chain \(\mathbf{x}(0\lt t \leq T)\) simply by executing a solver (refer to Eq. 9) on the SDE at any required precision (discretization).

2. Denoising Score Matching (DSM)

There is the other “Denoising score matching (DSM)” way of training \(\mathbf{s}_{\theta}\), which is slightly more complicated. At its core, the DSM objective for continuous diffusion is a continuous analogue of the discrete DSM objective.

Remember that in case of continuous diffusion, we never explicitly modelled the reverse conditionals \(p(\mathbf{x}(t)\vert \mathbf{x}(0))\). The reverse diffusion was defined rather implicitly (Eq. 10). Hence, the quantity in blue, unlike its discrete counterpart, isn’t very easy to compute in general. However, due to Särkkä and Solin there is an easy closed form for it when the drift function \(f\) is affine in nature. Thankfully, our specific choice of \(f(\mathbf{x}(t), t)\) is indeed affine.

Since the conditionals are gaussian (again), its pretty easy to derive the expression for \(\nabla_{\mathbf{x}(t)} \log p(\cdot)\). I leave it for the readers to try.

Computing log-likelihoods

One of the core reasons score models exist is the fact that it bypasses the need for training explicit log-likelihoods which are difficult to compute for a large range of powerful models. Turns out that in case of continuous diffusion models, there is an indirect way to evaluate the very log-likelihood. Let’s focus on the “generative process” of continuous diffusion models, i.e. the reverse diffusion in Eq. 10. What we would like to compute is \(p(\mathbf{x}(0))\) when \(\mathbf{x}(0)\) is generated by solving the SDE in Eq. 10 backwards with \(\mathbf{x}(T)\sim\mathcal{N}(0, \mathrm{I})\). Even though it is hard to compute marginal likelihoods \(p(\mathbf{x}(t))\) for any \(t\), it turns out there is exists a deterministic ODE (Ordinary Differential Equation) against the SDE in Eq. 10 whose marginal likelihoods match that of the SDE for every \(t\)

Note that the above ODE is essentially the same SDE but without the source of randomness. After learning the score (as usual), we simply drop-in replace the SDE with the above ODE. Now all thanks to Chen et al., 2018, this problem has already been solved. It is known as Continuous Normalizing Flow (CNF) whereby given \(\log p(\mathbf{x}(T))\), we can calculate \(\log p(\mathbf{x}(0))\) by solving the following ODE with any numerical solver for \(t = T \rightarrow 0\)

Please remember that this way of computing log-likelihood is merely an utility and cannot be used to train the model. A more recent paper however, shows a way to train SDE based continuous diffusion models by directly optimizing (a bound on) log-likelihood under some condition, which may be the topic for another article. I encourage readers to explore it themselves.

That’s all for today, see you. Stay tuned by subscribing to the RSS Feed. Thank you.

]]>Ayan DasGenerative modelling with Score Functions2021-07-14T00:00:00+00:002021-07-14T00:00:00+00:00https://ayandas.me/blog-tut/2021/07/14/generative-model-score-functionGenerative Models are of immense interest in fundamental research due to their ability to model the “all important” data distribution. A large class of generative models fall into the category of Probabilistic Graphical Models or PGMs. PGMs (e.g. VAE) usually train a parametric distribution (encoded in the form of graph structure) by minimizing log-likelihood, and samples from it by virtue of ancestral sampling. GANs, another class of popular generative model, take a different approach for training as well as sampling. Both class of models however, suffer from several drawbacks, e.g. difficult log-likelihood computation, unstable training etc. Recently, efforts have been made to craft generative models that inherit all good qualities from the existing ones. One of the rising classes of generative models is called “Score based Models”. Rather than explicitly maximizing the log-likelihood of a parametric density, it creates a map to navigate the data space. Once learned, sampling is done by Langevin Dynamics, an MCMC based method that actually navigates the data space using the map and lands on regions with high probability under empirical data distribution (i.e. real data regions). In this articles, we will describe the fundamentals of Score based models along with few of its variants.

Traditional log-likelihood based approaches define a parametric generative process in terms of graphical model and maximize the joint density \(p_{\theta}(\mathbf{x})\) w.r.t its parameters \(\theta\)

The joint density is often quite complex and sometimes intractable. For intractable cases, we maximize a surrogate objective based on e.g. Variational Inference. We achieve the above in practice by moving the parameters in the direction where the expected log-likelihood increases the most at every step \(t\). The expectation is computed empirically at points sampled from our dataset, i.e. the unknown data distribution \(p_{\mathrm{data}}(\mathbf{x})\)

There is one annoying requirement both in (1) and (2): the parametric model \(p_{\theta}(\mathbf{x})\) must be a valid density. We ensure such requirement by building the model only carefully combining known densities like Gaussian, Bernoulli, Dirichlet etc. Even though they are largly sufficient in terms of expressiveness, it might feel a bit too restrictive from system designer’s perspective.

Score based models (SBMs)

A new and emerging class of generative model, namely “Score based models (SBM)” entirely sidesteps the log-likelihood modelling and approaches the problem in a different way. In specific, SBMs attempt to learn a navigation map on the data space which guides any point on that space to reach a region highly probable under the data distribution \(p_{\mathrm{data}}(\mathbf{x})\). A little but careful though on this would lead us to something formally known as the Score function. The “Score” of an arbitrary point \(\mathbf{x} \in \mathbb{R}^d\) on the data space is essentially the gradient of the true data log-likelihood on that point

Please be careful and notice that the quantity on the right hand side of (3), i.e. \(\nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\) is not same as \(\nabla_{\theta} \log p_{\theta}(\mathbf{x})\), the quantity we encountered earlier (in MLE setup), even though they look structurally similar.

Given any point on the data space, the score tells us which direction to navigate if we would like see a region with higher likelihood. Unsurprisingly, if we take a little step toward the direction suggested by the score, we get a point \((\mathbf{x} + \alpha \cdot \mathbf{s}(\mathbf{x}))\) that is slightly more likely under \(p_{\mathrm{data}}(\mathbf{x})\). This is why I termed it as a “navigation map”, as in a guiding document that tells us the direction of the “treasure” (i.e. real data samples). All an SBM does is try to approximate the true score function via a parametric proxy

This is known as Score Matching. Once trained, we simply keep moving in the direction suggested by \(\mathbf{s}_{\theta^*}(\mathbf{x})\) starting from any random \(\mathbf{x}\) over finite time horizon \(T\). In practice, we move with a little bit of stochasticity - the formal procedure is known as Langevin Dynamics.

\(\mathbf{z} \sim \mathcal{N}(0, I)\) is the injected gaussian noise. If \(\alpha_t \rightarrow 0\) as \(t \rightarrow \infty\), this process gurantees \(\mathbf{x}_t\) to be a true sample from \(p_{\mathrm{data}}(\mathbf{x})\). In practice, we run this process for finite number of steps \(T\) and assign \(\alpha_t\) according to a decaying schedule. Please refer to the original paper for detailed discussion.

Looks all good. But, there are two problems with optimizing \(J(\theta)\).

Problem 1: The very obvious one; we don’t have access to the true scores \(\mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\). No one knows the exact form of \(p_{\mathrm{data}}(\mathbf{x})\).

Problem 2: The not-so-obvious one; the expection \(\mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})}\) is a bit problematic. Ideally, the objective must encourage learning the scores all over the data space (i.e. for every \(\mathbf{x} \in \mathbb{R}^d\)). But this isn’t possible with an expectation over only the data distribution. The regions of the data space which are unlikely under \(p_{\mathrm{data}}(\mathbf{x})\) do not get enough supervisory signals.

Implicit Score Matching (ISM)

Aapo Hyvärinen, 2005 solved the first problem quite elegantly and proposed the Implicit Score Matching objective \(J_{\mathrm{I}}(\theta)\) and showed it to be equivalent to \(J(\theta)\) under some mild regulatory conditions. The following remarkable result was shown in the original paper

The reason it’s known to be “remarkable” is the fact that \(J_{\mathrm{I}}(\theta)\) does not require the true target scores \(\mathbf{s}(\mathbf{x})\) at all. All we need is to compute an expectation w.r.t the data distribution which can be implemented using finite samples from our dataset. One practical problem with this objective is the amount of computation involved in the jacobian \(\nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\). Later, Song et al., 2019 proposed to use the Hutchinson’s trace estimator, a stochastic estimator for computing the trace of a matrix, which simplified the objective further

where \(\mathbf{v} \sim \mathcal{N}(0, \mathbf{I}) \in \mathbb{R}^d\) is a standard multivariate gaussian RV. This objective is computationally advantageous when used in conjunction with automatic differentiation frameworks (e.g. PyTorch) which can efficiently compute the vector-jacobian product (VJP), namely \(\mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\).

Denoising Score Matching (DSM)

In a different approach, Pascal Vincent, 2011 investigated the “unsuspected link” between Score Matching and Denoising Autoencoders. This work led to a very efficient and practical objective that is used even in the cutting-edge Score based models. Termed as “Denoising Score Matching (DSM)”, this approach mitigates both problem 1 & 2 described above and does so quite elegantly.

To get rid of problem 2, DSM proposes to simply use a noise-perturbed version of the dataset, i.e. replace \(p_{\mathrm{data}}(\mathbf{x})\) with \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\) where

The above equation basically tells us to create a perturbed/corrupted version of the original dataset by adding simple isotropic gaussian noise whose streangth is controlled by \(\sigma\), the std deviation of the gaussian. Since gaussian distribution is spanned over the entire space \(\mathbb{R}^d\), corrupted data samples populate much more region of the entire space and help the parameterized score function learn at regions which were originally unreachable under \(p_{\mathrm{data}}(\mathbf{x})\). The denoising objective \(J_{\mathrm{D}}(\theta)\) simply becomes

With a crucial proof shown in the appendix of the original paper, we can have an equivalent (changes shown in magenta) version of \(J_{\mathrm{D}}(\theta)\) as

Note that we now need original-corrupt data pairs \((\mathbf{\tilde{x}}, \mathbf{x})\) in order to compute the expectation, which is quite trivial to do. Also realize that the term \(\nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} \vert \mathbf{x})\) is not the data score but related only to the pre-specified noise model with quite an easy analytic form

The score function we learn this way isn’t actually for our original data distribution \(p_{\mathrm{data}}(\mathbf{x})\), but rather for the corrupted data distribution \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\). The strength \(\sigma\) decided how well it aligns with the original distribution. If \(\sigma\) is large, we end up learning too corrupted version of the data distribution; if \(\sigma\) is small, we no longer get the nice property out of the noise perturbation - so there is a trade-off. Recently, this trade-off has been utilized for learning robust score based models.

Moreover, Eq. 5 has a very intuitive interpretation and this is where Pascal Vincent, 2011 uncovered the link between DSM and Denoising Autoencoders. A closer look at Eq. 5 would reveal that the \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}})\) has a learning target of \(- \frac{1}{\sigma^2} (\mathbf{\tilde{x}} - \mathbf{x})\), which can be interpreted as an unit vector from corrupted sample towards the real sample. Succintly, the score function is trying to learn how to “de-noise” a corrupted sample - that’s precisely what Denoising Autoencoders do.

Noise Conditioned Score Network (NCSN)

The idea presented in Song et al., 2020 is to have \(L\) different noise-perturbed data distributions (with different \(\sigma\)) and one score function for each of them. The noise strengths are chosen to be \(\sigma_1 > \sigma_2 > \cdots > \sigma_L\), so that \(p_{\mathrm{data}}^{\sigma_1}(\mathbf{\tilde{x}})\) is the most corrupted distribution and \(p_{\mathrm{data}}^{\sigma_L}(\mathbf{\tilde{x}})\) is the least. Also, instead of having \(L\) separate score functions, we use one shared score function conditioned on \(\sigma\), i.e. \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}}; \sigma)\).

We finally learn the shared score function from the ensamble of \(L\) distributions

In order to sample, Song et al., 2020 proposed a modified version of Langevin Dynamics termed as “Annealed Langevin Dynamics”. The idea is simple: we start from a random sample and run the Langevin Dynamics (Eq. 4) using \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_1)\) for \(T\) steps. We use the final sample as the initial starting point for the next Langevin Dynamics with \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_2)\). We repeat this process till we get the final sample from \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_L)\). The intuition here is to sample at coarse level first and gradually fine-tune it to get high quality samples. The exact algorithm is depicted in Algorithm 1 of Song et al., 2020.

Connection to Stochastic Differential Equations

Recently, Song et al., 2021 have established a surprising connection between Score Models, Diffusion Models and Stochastic Differential Equation (SDEs). Diffusion Models are another rising class of generative models fundamentally similar to score based models but with some notable differences. Since we did not discuss Diffusion Models in this article, we cannot fully explain the connection and how to properly utilize it. However, I would like to show a brief preview of where exactly SDEs show up within the material discussed in this article.

Stochastic Differential Equations (SDEs) are stochastic dymanical systems with state \(\mathbf{x}(t)\), characterized by a Drift function \(f(\mathbf{x}, t)\) and a Diffusion function \(g(\mathbf{x}, t)\)

To find a connection now, it is only a matter of comparing Eq. 6 with Eq. 4. The sampling process defined by Langevin Dynamics is essentially an SDE discretized in time with

]]>Ayan Dasanyx: Build vector animations from programmatic descriptions2021-05-01T00:00:00+00:002021-05-01T00:00:00+00:00https://ayandas.me/projs/2021/05/01/anyxProject anyx (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although anyx is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. anyx is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like pygame, anyx allows users to simply write a description of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of anyx is motivated largely by a similar project called manim.

I work on this project only in my spare time and its not done yet. Read a brief description by clicking on the github icon.

]]>Ayan DasCloud2Curve: Generation and Vectorization of Parametric Sketches2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://ayandas.me/pubs/2021/03/01/pub-9
Paper

Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. We further aim to model sketches as a sequence of low-dimensional parametric curves. To this end, we propose an inverse graphics framework capable of approximating a raster or waypoint based stroke encoded as a point-cloud with a variable-degree Bézier curve. Building on this module, we present Cloud2Curve, a generative model for scalable high-resolution vector sketches that can be trained end-to-end using point-cloud data alone. As a consequence, our model is also capable of deterministic vectorization which can map novel raster or waypoint based sketches to their corresponding high-resolution scalable Bézier equivalent. We evaluate the generation and vectorization capabilities of our model on Quick, Draw! and K-MNIST datasets.

Slides for my CVPR '21 talk

PS: Reusing any of these slides would require permission from the author.

Full talk at CVPR 2021

Want to cite this paper ?

@misc{das2021cloud2curve,
title={Cloud2Curve: Generation and Vectorization of Parametric Sketches},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
year={2021},
eprint={2103.15536},
archivePrefix={arXiv},
primaryClass={cs.CV}
}