Jekyll2023-03-23T17:38:42+00:00https://ayandas.me/feed.xmlAyan Das<b>Ph.D. Student</b> @ <a href="https://www.surrey.ac.uk/">University of Surrey</a>; Senior DL Researcher @ <a href="https://www.mtkresearch.com/en">MediaTek Research</a>Ayan Dasa.das@surrey.ac.ukChiroDiff: Modelling chirographic data with Diffusion Models2023-01-21T00:00:00+00:002023-01-21T00:00:00+00:00https://ayandas.me/pub-11<center>
<a target="_blank" class="pubicon" href="https://openreview.net/pdf?id=1ROAstc9jv">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper (with Suppl.)
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/11.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">Generative modelling over continuous-time geometric constructs, a.k.a chirographic data such as handwriting, sketches, drawings etc., have been accomplished through autoregressive distributions. Such strictly-ordered discrete factorization however falls short of capturing key properties of chirographic data -- it fails to build holistic understanding of the temporal concept due to one-way visibility (causality). Consequently, temporal data has been modelled as discrete token sequences of fixed sampling rate instead of capturing the true underlying concept. In this paper, we introduce a powerful model-class namely Denoising Diffusion Probabilistic Models or DDPMs for chirographic data that specifically addresses these flaws. Our model named ChiroDiff, being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate up to a good extent. Moreover, we show that many important downstream utilities (e.g. conditional sampling, creative mixing) can be flexibly implemented using ChiroDiff. We further show some unique use-cases like stochastic vectorization, de-noising/healing, abstraction are also possible with this model-class. We perform quantitative and qualitative evaluation of our framework on relevant datasets and found it to be better or on par with competing approaches.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
Coming soon. Check later.
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<!-- <h2>5 min talk at ICLR 2022</h2>
<br>
<h4>Watch in <a href="https://recorder-v3.slideslive.com/?share=62339&s=64c1baa0-77c8-40b2-866a-ad8df30ad950" target="_blank">SlidesLive</a></h4> -->
Coming soon. Check later.
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
Coming soon. Check later.
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{
das2023chirodiff,
title={ChiroDiff: Modelling chirographic data with Diffusion Models},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=1ROAstc9jv}
}
</code></pre></div></div>Ayan DasPaper (with Suppl.)SketchODE: Learning neural sketch representation in continuous time2022-01-21T00:00:00+00:002022-01-21T00:00:00+00:00https://ayandas.me/pub-10<center>
<a target="_blank" class="pubicon" href="https://openreview.net/pdf?id=c-4HSDAWua5">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper (with Suppl.)
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/10.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">Learning meaningful representations for chirographic drawing data such as sketches, handwriting, and flowcharts is a gateway for understanding and emulating human creative expression. Despite being inherently continuous-time data, existing works have treated these as discrete-time sequences, disregarding their true nature. In this work, we model such data as continuous-time functions and learn compact representations by virtue of Neural Ordinary Differential Equations. To this end, we introduce the first continuous-time Seq2Seq model and demonstrate some remarkable properties that set it apart from traditional discrete-time analogues. We also provide solutions for some practical challenges for such models, including introducing a family of parameterized ODE dynamics & continuous-time data augmentation particularly suitable for the task. Our models are validated on several datasets including VectorMNIST, DiDi and Quick, Draw!.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my
ICLR 2022 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/49148ad5176542e58f6a5090ffda8b1b" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>5 min talk at ICLR 2022</h2>
<br />
<h4>Watch in <a href="https://recorder-v3.slideslive.com/?share=62339&s=64c1baa0-77c8-40b2-866a-ad8df30ad950" target="_blank">SlidesLive</a></h4>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<h2>Code repository</h2>
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/sketchode">
<i class="fa fa-file-code-o fa-3x"></i>SketchODE repo
</a>
<br />
<h2>The "VectorMNIST" dataset</h2>
<img width="60%" src="/public/pub_res/10_2.png" alt="VectorMNIST" />
<a target="_blank" class="pubicon" href="https://drive.google.com/file/d/1wpyIA9AkJ5oVR7T0P4Jpd1U4NGsPf2DX/view?usp=sharing">
<i class="fa fa-database fa-3x"></i>VectorMNIST
</a>
<h4>(Dataset will further be updated)</h4>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{
das2022sketchode,
title={Sketch{ODE}: Learning neural sketch representation in continuous time},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=c-4HSDAWua5}
}
</code></pre></div></div>Ayan DasPaper (with Suppl.)An introduction to Diffusion Probabilistic Models2021-12-04T00:00:00+00:002021-12-04T00:00:00+00:00https://ayandas.me/blog-tut/2021/12/04/diffusion-prob-models<p>Generative modelling is one of the seminal tasks for understanding the distribution of natural data. VAE, GAN and Flow family of models have dominated the field for last few years due to their practical performance. Despite commercial success, their theoretical and design shortcomings (intractable likelihood computation, restrictive architecture, unstable training dynamics etc.) have led to the developement of a new class of generative models named “Diffusion Probabilistic Models” or DPMs. Diffusion Models, first proposed by <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a>, inspire from thermodynam diffusion process and learn a noise-to-data mapping in discrete steps, very similar to Flow models. Lately, DPMs have been shown to have some intriguing connections to <a href="/blog-tut/2021/07/14/generative-model-score-function.html">Score Based Models (SBMs)</a> and <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">Stochastic Differential Equations (SDE)</a>. These connections have further been leveraged to create their continuous-time analogous. In this article, I will describe both the general framework of DPMs, their recent advancements and explore the connections to other frameworks. For the sake of readers, I will avoid gory details, rigorous mathematical derivations and use subtle simplifications in order to maintain focus on the core idea.</p>
<blockquote>
<p>In case you haven’t checked the first part of this two-part blog, please read <a href="/blog-tut/2021/07/14/generative-model-score-function.html">Score Based Models (SBMs)</a></p>
</blockquote>
<h2 id="what-exactly-do-we-mean-by-diffusion-">What exactly do we mean by “Diffusion” ?</h2>
<p>In thermodynamics, “Diffusion” refers to the flow of particles from high-density regions towards low-density regions. In the context of statistics, meaning of Diffusion is quite similar, i.e. the process of transforming a complex distribution \(p_{\mathrm{complex}}\) on \(\mathbb{R}^d\) to a simple (predefined) distribution \(p_{\mathrm{prior}}\) on the same domain. Succinctly, a transformation \(\mathcal{T}: \mathbb{R}^d \rightarrow \mathbb{R}^d\) such that</p>
<p>\[\tag{1}
\mathbf{x}_0 \sim p_{\mathrm{complex}} \implies \mathcal{T}(\mathbf{x}_0) \sim p_{\mathrm{prior}}
\]</p>
<p>where the symbol \(\implies\) means “implies”. There is a formal way to come up with a specific specific \(\{ \mathcal{T}, p_{\mathrm{prior}} \}\) pair that satisfies Eq. 1 for <em>any</em> distribution \(p_{\mathrm{complex}}\). In simple terms, we can take <em>any</em> distribution and transform it into a known (simple) density by means of a known transformation \(\mathcal{T}\). By “formal way”, I was referring to <a href="https://en.wikipedia.org/wiki/Markov_chain">Markov Chain</a> and its <a href="https://brilliant.org/wiki/stationary-distributions/">stationary distribution</a>, which says that by <strong>repeated application</strong> of a transition kernel \(q(\mathbf{x} \vert \mathbf{x}')\) on the samples of <em>any</em> distribution would lead to samples from \(p_{\mathrm{prior}}(\mathbf{x})\) if the following holds</p>
<p>\[
p_{\mathrm{prior}}(\mathbf{x}) = \int q(\mathbf{x} | \mathbf{x}’) p_{\mathrm{prior}}(\mathbf{x}’) d\mathbf{x}’
\]</p>
<p>We can related our original diffusion process in Eq. 1 with a markov chain by defining \(\mathcal{T}\) to be repeated application of the transition kernel \(q(\mathbf{x} \vert \mathbf{x}')\) over discrete time \(t\)</p>
<p>\[\tag{2}
\mathbf{x}_{t} \sim q(\mathbf{x} \vert \mathbf{x}’ = \mathbf{x}_{t-1}),\ \forall t > 0
\]</p>
<p>From the properties of <a href="https://brilliant.org/wiki/stationary-distributions/">stationary distribution</a>, we have \(\mathbf{x}_{\infty} \sim p_{\mathrm{prior}}\). In practice, we can keep the iterations to a sufficiently large finite number \(t = T\).</p>
<p>So far, we confirmed that there is indeed an iterative way (refer to Eq. 2) to convert the samples from a complex distributions to a known (simple) prior. Even though we talked only in terms of generic densities, there is one very attractive choice of \(\{ q, p_{\mathrm{prior}} \}\) pair (showed in <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a>) due to its simplicity and tractability</p>
\[\tag{3}
q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathrm{I}) \\
q(\mathbf{x}_T) = p_{\mathrm{prior}}(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathrm{I})\]
<p>For obvious reason, its known as <strong>Gaussian Diffusion</strong>. I purposefuly changed the notations of the random variables to make it more explicit. \(\beta_t \in \mathbb{R}\) is a predefined decaying schedule proposed by <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a>. A pictorial depiction of the diffusion process is shown in the diagram below.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/20/diffusion_over_time.png" />
<!-- <img width="57%" style="padding-top: 20px;" src ="/public/posts_res/20/fwddiff.gif" /> -->
</figure>
</center>
<h2 id="generative-modlling-by-undoing-the-diffusion">Generative modlling by undoing the diffusion</h2>
<p>We proved the existence of a stochastic transform \(\mathcal{T}\) that gurantees the diffusion process in Eq 1. Please realize that the diffusion process does not depend on the initial density \(p_{\mathrm{complex}}\) (as \(t \rightarrow \infty\)) and the only requirement is to be able to sample from it. This is the core idea behind Diffusion Models - we use the any data distribution (let’s say \(p_{\mathrm{data}}\)) of our choice as the complex initial density. This leads to the <strong>forward diffusion</strong> process</p>
<p>\[
\mathbf{x}_0 \sim p_{\mathrm{data}} \implies \mathbf{x}_T = \mathcal{T}(\mathbf{x}_0) \sim \mathcal{N}(\mathbf{0}, \mathrm{I})
\]</p>
<p>This process is responsible for “destructuring” the data and turning it into an isotropic gaussian (almost structureless). Please refer to the figure below (red part) for a visual demonstration.</p>
<center>
<figure>
<img width="46%" style="padding-top: 20px; float: left;" src="/public/posts_res/20/fwddiff.gif" />
<img width="46%" style="padding-top: 20px;" src="/public/posts_res/20/revdiff.gif" />
</figure>
</center>
<p>However, this isn’t very usefull by itself. What would be useful is doing the opposite, i.e. starting from isotropic gaussian noise and turning it into \(p_{\mathrm{data}}\) - that is generative modelling (blue part of the figure above). Since the forward process is fixed (non-parametric) and guranteed to exist, it is very much possible to invert it. Once inverted, we can use it as a generative model as follows</p>
<p>\[
\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathrm{I}) \implies \mathcal{T}^{-1}(\mathbf{x}_T) \sim p_{\mathrm{data}}
\]</p>
<p>Fortunately, the theroy of markov chain gurantees that for gaussian diffusion, there indeed exists a <strong>reverse diffusion</strong> process \(\mathcal{T}^{-1}\). The original paper from <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a> shows how a parametric model of diffusion \(\mathcal{T}^{-1}_{\theta}\) can be learned from data itself.</p>
<h2 id="graphical-model-and-training">Graphical model and training</h2>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/20/diffusion_pgm.png" />
</figure>
</center>
<p>The stochastic “forward diffusion” and “reverse diffusion” processes described so far can be well expressed in terms of <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models (PGMs)</a>. A series of \(T\) random variables define each of them; with the forward process being fully described by Eq. 3. The reverse process is expressed by a parametric graphical model very similar to that of the forward process, but in reverse</p>
\[\tag{4}
p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathrm{I}) \\
p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_{\theta}(\mathbf{x}_t, t), \mathbf{\Sigma}_{\theta}(\mathbf{x}_t, t))\]
<p>Each of the reverse conditionals \(p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) are structurally gaussians and responsible for learning to revert each corresponding steps in the forward process, i.e. \(q(\mathbf{x}_t \vert \mathbf{x}_{t-1})\). The means and covariances of these reverse conditionals are neural networks with parameters \(\theta\) and shared over timesteps. Just like any other probabilistic models, we wish to minimize the negative log-likelihood of the model distribution under the expectation of data distribution</p>
<p>\[
\mathcal{L} = \mathbb{E}_{\mathbf{x}_0 \sim p_{\mathrm{data}}}\big[ - \log p_{\theta}(\mathbf{x}_0) \big]
\]</p>
<p>which isn’t quite computable in practice due to its dependance on \((T-1)\) more random variables. With a fair bit of mathematical manipulations, <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a> (section 2.3) showed \(\mathcal{L}\) to be a lower-bound of another easily computatable quantity</p>
<p>\[
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0 \sim p_{\mathrm{data}},\ \mathbf{x}_{1:T} \sim q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \bigg[ - \log p(\mathbf{x}_T) - \sum_{t\geq 1} \log \frac{p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)}{q(\mathbf{x}_t \vert \mathbf{x}_{t-1})} \bigg]
\]</p>
<p>which is easy to compute and optimize. The expectation is over the joint distribution of the entire forward process. Getting a sample \(\mathbf{x}_{1:T} \sim q(\cdot \vert \mathbf{x}_0)\) boils down to executing the forward diffusion on one sample \(\mathbf{x}_0 \sim p_{\mathrm{data}}\). All quantities inside the expectation are tractable and available to us in closed form.</p>
<h2 id="further-simplification-variance-reduction">Further simplification: Variance Reduction</h2>
<p>Even though we can train the model with the lower-bound shown above, few more simplifications are possible. First one is due to <a href="http://proceedings.mlr.press/v37/sohl-dickstein15.html">Sohl-Dickstein et al., 2015</a> and in an attempt to reduce variance. Firstly, they showed that the lower-bound can be further simplified and re-written as the following</p>
<p>\[
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0,\ \mathbf{x}_{1:T}} \bigg[
\underbrace{\color{red}{D_{\mathrm{KL}}\big[ q(\mathbf{x}_T\vert \mathbf{x}_0) \| p(\mathbf{x}_T) \big]}}_{\text{Independent of }\theta} +
\sum_{t=1}^T D_{\mathrm{KL}}\big[ q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0) \| p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t )\big] \bigg]
\]</p>
<p>There is a subtle approximation involved (the edge case of \(t=1\) in the summation) in the above expression which is due to <a href="https://proceedings.neurips.cc//paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf">Ho et al., 2021</a> (section 3.3 and 3.4). The noticable change in this version is the fact that all conditionals \(q(\cdot \vert \cdot)\) of the forward process are now additionally conditioned on \(\mathbf{x}_0\). Earlier, the corresponding quantities had high uncertainty/variance due to different possible choices of the starting point \(\mathbf{x}_0\), which are now suppressed by the additional knowledge of \(\mathbf{x}_0\). Moreover, it turns out that \(q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0)\) has a closed form</p>
<p>\[
q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \mathbf{\tilde{\beta_t}}\mathrm{I})
\]</p>
<p>The exact form (refer to Eq. 7 of <a href="https://proceedings.neurips.cc//paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf">Ho et al., 2021</a>) is not important for holistic understanding of the algorithm. Only thing to note is that \(\mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)\) additionally contains \(\beta_t\) (fixed numbers) and \(\mathbf{\tilde{\beta_t}}\) is a function of \(\beta_t\) only. Moving on, we do the following on the last expression for \(\mathcal{L}\)</p>
<ol>
<li>Use the closed form of \(p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) in Eq. 4 with \(\mathbf{\Sigma}_{\theta}(\mathbf{x}_t, t) = \mathbf{\tilde{\beta_t}}\mathrm{I}\) (design choice for making things simple)</li>
<li>Expand the KL divergence formula</li>
<li>Convert \(\sum_{t=1}^T\) into expectation (over \(t \sim \mathcal{U}[1, T]\)) by scaling with a constant \(1/T\)</li>
</ol>
<p>.. and arrive at a simpler form</p>
<p>\[\tag{5}
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0,\ \mathbf{x}_{1:T},\ t} \bigg[ \frac{1}{2\mathbf{\tilde{\beta_t}}} \vert\vert \mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \mathbf{\mu}_{\theta}(\mathbf{x}_t, t) \vert\vert^2 \bigg]
\]</p>
<h2 id="further-simplification-forward-re-parameterization">Further simplification: Forward re-parameterization</h2>
<p>For the second simplification, we look at the forward process in a bit more detail. There is an amazing property of the forward diffusion with gaussian noise - the distribution of the noisy sample \(\mathbf{x}_t\) can be readily calculated given real data \(\mathbf{x}_0\) without touching any other steps.</p>
\[q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\underbrace{\textstyle{\prod}_{s=1}^t (1-\beta_s)}_{\alpha_t}} \cdot \mathbf{x}_0, (1-\underbrace{\textstyle{\prod}_{s=1}^t (1-\beta_s)}_{\alpha_t}) \cdot \mathrm{I})\]
<p>This is a consequence of the forward process being completely known and having well-defined probabilistic structure (gaussian noise). By means of (gaussian) reparameterization, we can also derive an easy way of sampling any \(\mathbf{x}_t\) only from standard gaussian noise vector \(\epsilon \sim \mathcal{N}(0, \mathrm{I})\)</p>
<!-- As a result, sampling a forward diffusion sequence $$\mathbf{x}_{1:T} \sim q(\cdot \vert \mathbf{x}_0)$$ no longer requires ancestral sampling like in Eq. 2, but only require $$\mathbf{x}_t \sim q(\mathbf{x}_t \vert \mathbf{x}_0)$$ for *any $$t$$ in any order*. -->
\[\tag{6}
\mathbf{x}_t(\mathbf{x}_0, \epsilon) = \sqrt{\alpha_t} \cdot \mathbf{x}_0 + \sqrt{1-\alpha_t} \cdot \mathbf{\epsilon}\]
<!-- \text{Or, } \mathbf{x}_0 = \frac{1}{k^{\mu}_t} (\mathbf{x}_t - \sqrt{k^{\sigma}_t} \cdot \mathbf{\epsilon}) -->
<!-- That is, sampling from any $$q(\mathbf{x}_t \vert \mathbf{x}_0)$$ would only require computing the above equation with a *single* standard gaussian noise vector $$\mathbf{\epsilon}$$. -->
<p>As a result, \(\mathbf{x}_{1:T}\) need not be sampled with ancestral sampling (refer to Eq. 2 & 3), but only require computing Eq. 6 with all \(t\) in <strong>any order</strong>. This further simpifies the expectation in Eq. 5 to (changes highlighted in blue)</p>
<p>\[\tag{7}
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0,\ \color{blue}{\mathbf{\epsilon}},\ t} \bigg[ \frac{1}{2\mathbf{\tilde{\beta_t}}} \vert\vert \mathbf{\tilde{\mu}}_t(\color{blue}{\mathbf{x}_t(\mathbf{x}_0, \epsilon)}, \mathbf{x}_0) - \mathbf{\mu}_{\theta}(\color{blue}{\mathbf{x}_t(\mathbf{x}_0, \epsilon)}, t) \vert\vert^2 \bigg]
\]</p>
<!-- \color{blue}{\frac{1}{k^{\mu}_t} (\mathbf{x}_t - \sqrt{k^{\sigma}_t} \cdot \mathbf{\epsilon})} -->
<p>This is the final form that can be implemented in practice as suggested by <a href="https://proceedings.neurips.cc//paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf">Ho et al., 2021</a>.</p>
<h2 id="connection-to-score-based-models-sbm">Connection to Score-based models (SBM)</h2>
<p><a href="https://proceedings.neurips.cc//paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf">Ho et al., 2021</a> uncovered a link between Eq. 7 and a particular <a href="/blog-tut/2021/07/14/generative-model-score-function.html">Score-based models</a> known as <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Noise Conditioned Score Network (NCSN)</a>. With the help of the reparameterized form of \(\mathbf{x}_t\) (Eq. 6) and the functional form of \(\mathbf{\tilde{\mu}}_t\), one can easily (with few simplification steps) reduce Eq. 7 to</p>
<p>\[
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon},\ t} \bigg[ \frac{1}{2\mathbf{\tilde{\beta_t}}} \left\vert\left\vert \color{blue}{\frac{1}{\sqrt{1-\beta_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}} \epsilon \right)} - \mathbf{\mu}_{\theta}(\mathbf{x}_t, t) \right\vert\right\vert^2 \bigg]
\]</p>
<p>The above equation is a simple regression with \(\mathbf{\mu}_{\theta}\) being the parametric model (neural net in practice) and the quantity in blue is its regression target. Without loss of generality, we can slightly modify the definition of the parametric model to be \(\mathbf{\mu}_{\theta}(\mathbf{x}_t, t) = \frac{1}{\sqrt{1-\beta_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}} \epsilon_{\theta}(\mathbf{x}_t, t) \right)\). The only “moving part” in the new parameterization is \(\epsilon_{\theta}(\cdot)\); the rest (i.e. \(\mathbf{x}_t\) and \(t\)) are explicitly available to the model. This leads to the following form</p>
\[\tag{8}
\mathcal{L} \leq \mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon},\ t} \bigg[ \color{red}{\frac{\beta_t^2}{2\mathbf{\tilde{\beta_t}}(1-\beta_t)(1-\alpha_t)}} \vert\vert \epsilon - \mathbf{\epsilon}_{\theta}(\underbrace{\sqrt{\alpha_t} \cdot \mathbf{x}_0 + \sqrt{1-\alpha_t} \cdot \mathbf{\epsilon}}_{\mathbf{x}_t}, t) \vert\vert^2 \bigg] \\
\approx \mathbb{E}_t \bigg[ \mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon}} \left[ \vert\vert \epsilon - \mathbf{\epsilon}_{\theta}(\mathbf{x}_t, t) \vert\vert^2 \right] \bigg] \approx \frac{1}{T} \sum_{t=1}^T p(t) \mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon}} \left[ \vert\vert \epsilon - \mathbf{\epsilon}_{\theta}(\mathbf{x}_t, t) \vert\vert^2 \right]\]
<p>The expression in red can be discarded without any effect in performance (suggested by <a href="https://proceedings.neurips.cc//paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf">Ho et al., 2021</a>). I have further approximated the expectation over time-steps with sample average. If you look at the final form, you may notice a surprising resemblance with <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Noise Conditioned Score Network (NCSN)</a>. Please refer to \(J_{\mathrm{ncsn}}\) in <a href="/blog-tut/2021/07/14/generative-model-score-function.html">my blog on score models</a>. Below I pin-point the specifics:</p>
<ol>
<li>The time-steps \(t=1, 2, \cdots T\) resemble the increasing “noise-scales” in NCSN. Recall that the noise increases as the forward diffusion approaches the end.</li>
<li>The expectation \(\mathbb{E}_{\mathbf{x}_0,\ \mathbf{\epsilon}}\) (for each scale) holistically matches that of denoising score matching objective, i.e. \(\mathbb{E}_{\mathbf{x}, \mathbf{\tilde{x}}}\). In case of Diffusion Model, The noisy sample can be readily computed using the noise vector \(\epsilon\) (refer to Eq. 6).</li>
<li>Just like NCSN, the regression target is the noise vector \(\epsilon\) for each time-step (or scale).</li>
<li>Just like NCSN, the learnable model depends on the noisy sample and the time-step (or scale).</li>
</ol>
<h2 id="infinitely-many-noise-scales--continuous-time-analogue">Infinitely many noise scales & continuous-time analogue</h2>
<p>Inspired by the connection between Diffusion Model and Score-based models, <a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a> proposed to use infinitely many noise scales (equivalently time-steps). At first, it might look like a trivial increase in number of steps/scales, but there happened to be a principled way to achieve it, namely <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">Stochastic Differential Equations</a> or SDEs. <a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a> reworked the whole formulation considering continuous SDE as forward diffusion. Interestingly, it turned out that the reverse process is also an SDE that depends on the score function.</p>
<p>Quite simply, finite time-steps/scales (i.e. \(t = 0, 1, \cdots T\)) are replaced by infinitely many segments (of length \(\Delta t \rightarrow 0\)) within time-range \([0, T]\). Instead of \(\mathbf{x}_t\) at every discrete time-step/scale, we define a continuous random process \(\mathbf{x}(t)\) indexed by continuous time \(t\). We also replace the discrete-time conditionals in Eq. 3 with continuous analogues. But this time, we define the “increaments” in each step rather than absolute values, i.e. the transition kernel specifies <em>what to add</em> to the previous value. Specifically, we define a general form of <strong>continuous forwrad diffusion</strong> with</p>
\[q(\mathbf{x}(t+\Delta t) - \mathbf{x}(t) \vert \mathbf{x}(t)) = \mathcal{N}(f(\mathbf{x}(t), t) \Delta t, g^2(t) \Delta t^2) \\
\tag{9}
\text{Or, }\mathbf{x}(t+\Delta t) - \mathbf{x}(t) = f(\mathbf{x}(t), t)\Delta t + g(t) \cdot \underbrace{\Delta t \cdot \epsilon}_{\Delta \omega}\text{, with }\epsilon \sim \mathcal{N}(0, \mathrm{I})\]
<p>If you have ever studied SDEs, you might recognize that Eq. 9 resembles <a href="https://en.wikipedia.org/wiki/Euler%E2%80%93Maruyama_method">Euler–Maruyama</a> numerical solver for SDEs. Considering \(f(\mathbf{x}(t), t)\) to be the “Drift function”, \(g(t)\) be the “Diffusion function” and \(\Delta \omega \sim \mathcal{N}(0, \Delta t)\) being the discrete differential of <a href="https://en.wikipedia.org/wiki/Wiener_process">Wiener Process</a> \(\omega(t)\), in the limit of \(\Delta t \rightarrow 0\), the following SDE can be recovered</p>
\[d\mathbf{x}(t) = f(\mathbf{x}(t), t)\cdot dt + g(t)\cdot d\omega(t)\text{, with }d\omega(t) \sim \mathcal{N}(0, dt)\]
<p>A visualization of the continuous forward diffusion in 1D is given below for a set of samples (different colors).</p>
<center>
<figure>
<img width="70%" style="padding-top: 20px;" src="/public/posts_res/20/sde.png" />
</figure>
</center>
<p><a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a> (section 3.4) proposed few different choices \(\{f, g\}\) named Variance Exploding (VE), Variance Preserving (VP) and sub-VP. The one that resembles Eq. 3 (discrete forward diffusion) in continuous time and ensures proper diffusion, i.e. \(\mathbf{x}(0) \sim p_{\mathrm{data}} \implies \mathbf{x}(T) \sim \mathcal{N}(0, \mathrm{I})\) is \(f(\mathbf{x}(t), t) = -\frac{1}{2}\beta(t)\mathbf{x}(t),\ g(t) = \sqrt{\beta(t)}\). This particular SDE is termed as “Variance Preserving (VP) SDE” since the variance of \(\mathbf{x}(t)\) is finite as long as the variance of \(\mathbf{x}(0)\) if finite (Appendix B of <a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a>). We can enforce the covariance of \(\mathbf{x}(0)\) to be \(\mathrm{I}\) simply by standardizing our dataset.</p>
<p>An old (but remarkable) result due to <a href="https://www.sciencedirect.com/science/article/pii/0304414982900515">Anderson, 1982</a> shows that the above forward diffusion can be reversed even in closed form, thanks to the following SDE</p>
\[\tag{10}
d\mathbf{x}(t) = \bigg[ f(\mathbf{x}(t), t) - g^2(t) \underbrace{\nabla_{\mathbf{x}} \log p(\mathbf{x}(t))}_{\text{score }\mathbf{s}(\mathbf{x}(t), t)} \bigg] dt + g(t) d\omega(t)\]
<p>Hence, the <strong>reverse diffusion</strong> is simply solving the above SDE in reverse time with initial state \(\mathbf{x}(T) \sim \mathcal{N}(0, \mathrm{I})\), leading to \(\mathbf{x}(0) \sim p_{\mathrm{data}}\). The only missing part is the score, i.e. \(\mathbf{s}(\mathbf{x}(t), t) \triangleq \nabla_{\mathbf{x}} \log p(\mathbf{x}(t))\). Thankfully, we have already seen <a href="/blog-tut/2021/07/14/generative-model-score-function.html">how score estimation works</a> and that is pretty much what we do here. There are two ways, as explained in my <a href="/blog-tut/2021/07/14/generative-model-score-function.html">blog on score models</a>. I briefly go over them below in the context of continuous SDEs:</p>
<h4 id="1-implicit-score-matching-ism">1. Implicit Score Matching (ISM)</h4>
<p>The <em>easier</em> way is to use the Hutchinson trace-estimator based score matching proposed by <a href="http://auai.org/uai2019/proceedings/papers/204.pdf">Song et al., 2019</a> called “Sliced Score Matching”.</p>
<p>\[
J_{\mathrm{I}}(\theta) = \mathbb{E}_{\mathbf{v}\sim\mathcal{N}(0, \mathrm{I})} \mathbb{E}_{\mathbf{x}(0)\sim p_{\mathrm{data}}} \mathbb{E}_{\mathbf{x}(0 \lt t \leq T)\sim q(\cdot\vert \mathbf{x}(0))} \bigg[ \frac{1}{2} \left|\left| \mathbf{s}_{\theta}(\mathbf{x}(t), t) \right|\right|^2 + \mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x}(t), t) \mathbf{v} \bigg]
\]</p>
<p>Very similar to NCSN, we define a parametric score network \(\mathbf{s}_{\theta}(\mathbf{x}(t), t)\) dependent on continuous time/scale \(t\). Starting from data samples \(\mathbf{x}(0)\sim p_{\mathrm{data}}\), we can generate the rest of the forward chain \(\mathbf{x}(0\lt t \leq T)\) simply by executing a solver (refer to Eq. 9) on the SDE at any required precision (discretization).</p>
<h4 id="2-denoising-score-matching-dsm">2. Denoising Score Matching (DSM)</h4>
<p>There is the other “Denoising score matching (DSM)” way of training \(\mathbf{s}_{\theta}\), which is slightly more complicated. At its core, the DSM objective for continuous diffusion is a continuous analogue of the discrete DSM objective.</p>
\[J_{\mathrm{D}}(\theta) = \frac{1}{T} \sum_{t=1}^T p(t) \mathbb{E}_{\mathbf{x}(0),\ \mathbf{x}(t)} \left[ \vert\vert \mathbf{s}_{\theta}(\mathbf{x}(t), t) - \color{blue}{\nabla_{\mathbf{x}(t)} \log p(\mathbf{x}(t)\vert \mathbf{x}(0))} \vert\vert^2 \right]\]
<p>Remember that in case of continuous diffusion, we never explicitly modelled the reverse conditionals \(p(\mathbf{x}(t)\vert \mathbf{x}(0))\). The reverse diffusion was defined rather implicitly (Eq. 10). Hence, the quantity in blue, unlike its discrete counterpart, isn’t very easy to compute <em>in general</em>. However, due to <a href="https://users.aalto.fi/~asolin/sde-book/sde-book.pdf"> Särkkä and Solin</a> there is an easy closed form for it when the drift function \(f\) is <strong>affine</strong> in nature. Thankfully, our specific choice of \(f(\mathbf{x}(t), t)\) is indeed affine.</p>
\[p(\mathbf{x}(t)\vert \mathbf{x}(0)) = \mathcal{N}(\mathbf{x}(t);\ \mathbf{x}(0)e^{-0.5\int_0^t \beta(s)ds},\ \mathrm{I}-\mathrm{I}e^{-0.5\int_0^t \beta(s)ds})\]
<p>Since the conditionals are gaussian (again), its pretty easy to derive the expression for \(\nabla_{\mathbf{x}(t)} \log p(\cdot)\). I leave it for the readers to try.</p>
<h2 id="computing-log-likelihoods">Computing log-likelihoods</h2>
<p>One of the core reasons score models exist is the fact that it bypasses the need for training explicit log-likelihoods which are difficult to compute for a large range of powerful models. Turns out that in case of continuous diffusion models, there is an indirect way to evaluate the very log-likelihood. Let’s focus on the “generative process” of continuous diffusion models, i.e. the <strong>reverse diffusion</strong> in Eq. 10. What we would like to compute is \(p(\mathbf{x}(0))\) when \(\mathbf{x}(0)\) is generated by solving the SDE in Eq. 10 backwards with \(\mathbf{x}(T)\sim\mathcal{N}(0, \mathrm{I})\). Even though it is hard to compute marginal likelihoods \(p(\mathbf{x}(t))\) for any \(t\), it turns out there is exists a <strong>deterministic ODE (Ordinary Differential Equation)</strong> against the SDE in Eq. 10 whose marginal likelihoods <em>match</em> that of the SDE for every \(t\)</p>
\[\frac{d\mathbf{x}(t)}{dt} = \bigg[ f(\mathbf{x}(t), t) - g^2(t) \underbrace{\nabla_{\mathbf{x}} \log p(\mathbf{x}(t))}_{\approx\ \mathbf{s}_{\theta}(\mathbf{x}(t), t)} \bigg] \triangleq F(\mathbf{x}(t), t)\]
<p>Note that the above ODE is essentially the same SDE but without the source of randomness. After learning the score (as usual), we simply drop-in replace the SDE with the above ODE. Now all thanks to <a href="https://proceedings.neurips.cc/paper/2018/file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf">Chen et al., 2018</a>, this problem has already been solved. It is known as Continuous Normalizing Flow (CNF) whereby given \(\log p(\mathbf{x}(T))\), we can calculate \(\log p(\mathbf{x}(0))\) by solving the following ODE with any numerical solver for \(t = T \rightarrow 0\)</p>
\[\frac{\partial}{\partial t} \log p(\mathbf{x}(t)) = - \mathrm{tr}\left( \frac{d}{d\mathbf{x}(t)} F(\mathbf{x}(t), t) \right)\]
<p>Please remember that this way of computing log-likelihood is merely an utility and cannot be used to train the model. A <a href="https://arxiv.org/pdf/2101.09258.pdf">more recent paper</a> however, shows a way to train SDE based continuous diffusion models by directly optimizing (a bound on) log-likelihood under some condition, which may be the topic for another article. I encourage readers to explore it themselves.</p>
<hr />
<p>That’s all for today, see you. Stay tuned by subscribing to the <a href="/feed.xml">RSS Feed</a>. Thank you.</p>Ayan DasGenerative modelling is one of the seminal tasks for understanding the distribution of natural data. VAE, GAN and Flow family of models have dominated the field for last few years due to their practical performance. Despite commercial success, their theoretical and design shortcomings (intractable likelihood computation, restrictive architecture, unstable training dynamics etc.) have led to the developement of a new class of generative models named “Diffusion Probabilistic Models” or DPMs. Diffusion Models, first proposed by Sohl-Dickstein et al., 2015, inspire from thermodynam diffusion process and learn a noise-to-data mapping in discrete steps, very similar to Flow models. Lately, DPMs have been shown to have some intriguing connections to Score Based Models (SBMs) and Stochastic Differential Equations (SDE). These connections have further been leveraged to create their continuous-time analogous. In this article, I will describe both the general framework of DPMs, their recent advancements and explore the connections to other frameworks. For the sake of readers, I will avoid gory details, rigorous mathematical derivations and use subtle simplifications in order to maintain focus on the core idea.Generative modelling with Score Functions2021-07-14T00:00:00+00:002021-07-14T00:00:00+00:00https://ayandas.me/blog-tut/2021/07/14/generative-model-score-function<p>Generative Models are of immense interest in fundamental research due to their ability to model the “all important” data distribution. A large class of generative models fall into the category of <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models</a> or PGMs. PGMs (e.g. <a href="/blog-tut/2020/01/01/variational-autoencoder.html">VAE</a>) usually train a parametric distribution (encoded in the form of graph structure) by minimizing log-likelihood, and samples from it by virtue of ancestral sampling. GANs, another class of popular generative model, take a different approach for training as well as sampling. Both class of models however, suffer from several drawbacks, e.g. difficult log-likelihood computation, unstable training etc. Recently, efforts have been made to craft generative models that inherit all good qualities from the existing ones. One of the rising classes of generative models is called “<em>Score based Models</em>”. Rather than explicitly maximizing the log-likelihood of a parametric density, it creates <em>a map to navigate the data space</em>. Once learned, sampling is done by <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_Langevin_dynamics">Langevin Dynamics</a>, an MCMC based method that actually navigates the data space using the map and lands on regions with high probability under empirical data distribution (i.e. real data regions). In this articles, we will describe the fundamentals of Score based models along with few of its variants.</p>
<blockquote>
<p>The next part of this two-part blog is <a href="/blog-tut/2021/12/04/diffusion-prob-models.html">Diffusion Probabilistic Models</a></p>
</blockquote>
<h2 id="traditional-maximum-likelihood-mle">Traditional maximum-likelihood (MLE)</h2>
<p>Traditional log-likelihood based approaches define a parametric generative process in terms of <a href="/blog-tut/2019/11/20/inference-in-pgm.html">graphical model</a> and maximize the joint density \(p_{\theta}(\mathbf{x})\) w.r.t its parameters \(\theta\)</p>
<p>\[\tag{1}
\theta^* = arg\max_{\theta} \big[ \log p_{\theta}(\mathbf{x}) \big]
\]</p>
<p>The joint density is often quite complex and sometimes intractable. For intractable cases, we maximize a surrogate objective based on e.g. <a href="/blog-tut/2020/01/01/variational-autoencoder.html">Variational Inference</a>. We achieve the above in practice by moving the parameters in the direction where the expected log-likelihood increases the most at every step \(t\). The expectation is computed empirically at points sampled from our dataset, i.e. the unknown data distribution \(p_{\mathrm{data}}(\mathbf{x})\)</p>
<p>\[
\theta_{t+1} = \theta_t + \alpha \cdot \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}}(\mathbf{x})} \big[ \nabla_{\theta} \log p_{\theta}(\mathbf{x}) \big]
\]</p>
<p>With a trained set of parameters \(\theta^*\), we sample from the model with ancestral sampling by exploiting its graphical structure</p>
<p>\[\tag{2}
\mathbf{x}_{\mathrm{sample}} \sim p_{\theta^*}(\mathbf{x})
\]</p>
<p>There is one annoying requirement both in (1) and (2): the parametric model \(p_{\theta}(\mathbf{x})\) must be a valid density. We ensure such requirement by building the model only carefully combining known densities like Gaussian, Bernoulli, Dirichlet etc. Even though they are largly sufficient in terms of expressiveness, it might feel a bit too restrictive from system designer’s perspective.</p>
<h2 id="score-based-models-sbms">Score based models (SBMs)</h2>
<p>A new and emerging class of generative model, namely “Score based models (SBM)” entirely sidesteps the log-likelihood modelling and approaches the problem in a different way. In specific, SBMs attempt to learn a <em>navigation map</em> on the data space which guides any point on that space to reach a region highly probable under the data distribution \(p_{\mathrm{data}}(\mathbf{x})\). A little but careful though on this would lead us to something formally known as the <em>Score function</em>. The “Score” of an arbitrary point \(\mathbf{x} \in \mathbb{R}^d\) on the data space is essentially the gradient of the <em>true</em> data log-likelihood on that point</p>
<p>\[\tag{3}
\mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x}) \in \mathbb{R}^d
\]</p>
<p>Please be careful and notice that the quantity on the right hand side of (3), i.e. \(\nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\) is <strong>not</strong> same as \(\nabla_{\theta} \log p_{\theta}(\mathbf{x})\), the quantity we encountered earlier (in MLE setup), even though they look structurally similar.</p>
<p>Given <em>any</em> point on the data space, the score tells us which direction to navigate if we would like see a region with higher likelihood. Unsurprisingly, if we take a little step toward the direction suggested by the score, we get a point \((\mathbf{x} + \alpha \cdot \mathbf{s}(\mathbf{x}))\) that is slightly more likely under \(p_{\mathrm{data}}(\mathbf{x})\). This is why I termed it as a “navigation map”, as in a guiding document that tells us the direction of the “treasure” (i.e. real data samples). All an SBM does is try to approximate the true score function via a parametric proxy</p>
<p>\[
\mathbf{s}_{\theta}(\mathbf{x}) \approx \mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})
\]</p>
<p>As simple as it might sound, we construct a regression problem with \(\mathbf{s}(\mathbf{x})\) as regression targets. We minimize the following loss</p>
<p>\[
J(\theta) = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) - \mathbf{s}(\mathbf{x}) \right|\right|_2^2 \bigg] = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x}) \right|\right|_2^2 \bigg]
\]</p>
<p>This is known as <em>Score Matching</em>. Once trained, we simply keep moving in the direction suggested by \(\mathbf{s}_{\theta^*}(\mathbf{x})\) starting from any random \(\mathbf{x}\) over finite time horizon \(T\). In practice, we move with a little bit of stochasticity - the formal procedure is known as <em>Langevin Dynamics</em>.</p>
<p>\[\tag{4}
\mathbf{x}_{t+1} = \mathbf{x}_{t} + \alpha_t \cdot \mathbf{s}_{\theta^*}(\mathbf{x}_t) + \sqrt{2 \alpha_t} \cdot \mathbf{z}
\]</p>
<p>\(\mathbf{z} \sim \mathcal{N}(0, I)\) is the injected gaussian noise. If \(\alpha_t \rightarrow 0\) as \(t \rightarrow \infty\), this process gurantees \(\mathbf{x}_t\) to be a true sample from \(p_{\mathrm{data}}(\mathbf{x})\). In practice, we run this process for finite number of steps \(T\) and assign \(\alpha_t\) according to a decaying schedule. Please refer to <a href="https://www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf">the original paper</a> for detailed discussion.</p>
<p>Looks all good. But, there are two problems with optimizing \(J(\theta)\).</p>
<ul>
<li>
<p><strong>Problem 1:</strong> The very obvious one; we don’t have access to the true scores \(\mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\). No one knows the exact form of \(p_{\mathrm{data}}(\mathbf{x})\).</p>
</li>
<li>
<p><strong>Problem 2:</strong> The not-so-obvious one; the expection \(\mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})}\) is a bit problematic. Ideally, the objective must encourage learning the scores all over the data space (i.e. for every \(\mathbf{x} \in \mathbb{R}^d\)). But this isn’t possible with an expectation over only the data distribution. The regions of the data space which are unlikely under \(p_{\mathrm{data}}(\mathbf{x})\) do not get enough supervisory signals.</p>
</li>
</ul>
<h2 id="implicit-score-matching-ism">Implicit Score Matching (ISM)</h2>
<p><a href="https://jmlr.org/papers/volume6/hyvarinen05a/old.pdf">Aapo Hyvärinen, 2005</a> solved the first problem quite elegantly and proposed the <em>Implicit Score Matching</em> objective \(J_{\mathrm{I}}(\theta)\) and showed it to be equivalent to \(J(\theta)\) under some mild regulatory conditions. The following remarkable result was shown in the original paper</p>
<p>\[
J_{\mathrm{I}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \frac{1}{2} \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) \right|\right|^2 + \mathrm{tr}(\nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})) \bigg]
\]</p>
<p>The reason it’s known to be “remarkable” is the fact that \(J_{\mathrm{I}}(\theta)\) does not require the true target scores \(\mathbf{s}(\mathbf{x})\) <em>at all</em>. All we need is to compute an expectation w.r.t the data distribution which can be implemented using finite samples from our dataset. One practical problem with this objective is the amount of computation involved in the jacobian \(\nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\). Later, <a href="http://auai.org/uai2019/proceedings/papers/204.pdf">Song et al., 2019</a> proposed to use the <a href="https://www.tandfonline.com/doi/abs/10.1080/03610919008812866">Hutchinson’s trace estimator</a>, a stochastic estimator for computing the trace of a matrix, which simplified the objective further</p>
<p>\[
J_{\mathrm{I}}(\theta) = \mathbb{E}_{p_{\mathbf{v}}} \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \frac{1}{2} \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) \right|\right|^2 + \mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x}) \mathbf{v} \bigg]
\]</p>
<p>where \(\mathbf{v} \sim \mathcal{N}(0, \mathbf{I}) \in \mathbb{R}^d\) is a standard multivariate gaussian RV. This objective is computationally advantageous when used in conjunction with automatic differentiation frameworks (e.g. PyTorch) which can efficiently compute the vector-jacobian product (VJP), namely \(\mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\).</p>
<h2 id="denoising-score-matching-dsm">Denoising Score Matching (DSM)</h2>
<p>In a different approach, <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">Pascal Vincent, 2011</a> investigated the “unsuspected link” between Score Matching and <a href="https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf">Denoising Autoencoders</a>. This work led to a very efficient and practical objective that is used even in the cutting-edge Score based models. Termed as “Denoising Score Matching (DSM)”, this approach mitigates both problem 1 & 2 described above and does so quite elegantly.</p>
<p>To get rid of problem 2, DSM proposes to simply use a noise-perturbed version of the dataset, i.e. replace \(p_{\mathrm{data}}(\mathbf{x})\) with \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\) where</p>
<p>\[
p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}) = \int p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x}) d\mathbf{x} \text{, with } p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x}) = p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x}) \cdot p_{\mathrm{data}}(\mathbf{x})
\]</p>
<p>The above equation basically tells us to create a perturbed/corrupted version of the original dataset by adding simple isotropic gaussian noise whose streangth is controlled by \(\sigma\), the std deviation of the gaussian. Since gaussian distribution is spanned over the entire space \(\mathbb{R}^d\), corrupted data samples populate much more region of the entire space and help the parameterized score function learn at regions which were originally unreachable under \(p_{\mathrm{data}}(\mathbf{x})\). The denoising objective \(J_{\mathrm{D}}(\theta)\) simply becomes</p>
<p>\[
J_{\mathrm{D}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}) - \nabla_{\mathbf{\tilde{x}}} \log p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}) \right|\right|_2^2 \bigg]
\]</p>
<p>With a crucial proof shown in the appendix of the <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">original paper</a>, we can have an equivalent (changes shown in magenta) version of \(J_{\mathrm{D}}(\theta)\) as</p>
<p>\[\tag{5}
J_{\mathrm{D}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\color{magenta}{\mathbf{\tilde{x}}, \mathbf{x}})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}) - \nabla_{\mathbf{\tilde{x}}} \log \color{magenta}{p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x})} \right|\right|_2^2 \bigg]
\]</p>
<p>Note that we now need original-corrupt data pairs \((\mathbf{\tilde{x}}, \mathbf{x})\) in order to compute the expectation, which is quite trivial to do. Also realize that the term \(\nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} \vert \mathbf{x})\) is not the data score but related only to the pre-specified noise model with quite an easy analytic form</p>
<p>\[
\nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} \vert \mathbf{x}) = - \frac{1}{\sigma^2} (\mathbf{\tilde{x}} - \mathbf{x})
\]</p>
<p>The score function we learn this way isn’t actually for our original data distribution \(p_{\mathrm{data}}(\mathbf{x})\), but rather for the corrupted data distribution \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\). The strength \(\sigma\) decided how well it aligns with the original distribution. If \(\sigma\) is large, we end up learning too corrupted version of the data distribution; if \(\sigma\) is small, we no longer get the nice property out of the noise perturbation - so there is a trade-off. Recently, this trade-off has been utilized for learning robust score based models.</p>
<p>Moreover, Eq. 5 has a very intuitive interpretation and this is where <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">Pascal Vincent, 2011</a> uncovered the link between DSM and Denoising Autoencoders. A closer look at Eq. 5 would reveal that the \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}})\) has a learning target of \(- \frac{1}{\sigma^2} (\mathbf{\tilde{x}} - \mathbf{x})\), which can be interpreted as an unit vector from corrupted sample towards the real sample. Succintly, the score function is trying to learn how to “de-noise” a corrupted sample - that’s precisely what Denoising Autoencoders do.</p>
<h2 id="noise-conditioned-score-network-ncsn">Noise Conditioned Score Network (NCSN)</h2>
<p>The idea presented in <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a> is to have \(L\) different noise-perturbed data distributions (with different \(\sigma\)) and one score function for each of them. The noise strengths are chosen to be \(\sigma_1 > \sigma_2 > \cdots > \sigma_L\), so that \(p_{\mathrm{data}}^{\sigma_1}(\mathbf{\tilde{x}})\) is the most corrupted distribution and \(p_{\mathrm{data}}^{\sigma_L}(\mathbf{\tilde{x}})\) is the least. Also, instead of having \(L\) separate score functions, we use one shared score function conditioned on \(\sigma\), i.e. \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}}; \sigma)\).</p>
<p>We finally learn the shared score function from the ensamble of \(L\) distributions</p>
<p>\[
J_{\mathrm{ncsn}}(\theta) = \frac{1}{L} \sum_{l=1}^L \sigma^2 \cdot J_{\mathrm{D}}^{\sigma_l}(\theta)
\]</p>
<p>where \(J_{\mathrm{D}}^{\sigma}(\theta)\) is same as Eq. 5 but uses the shared score network parameterized by \(\sigma\)</p>
<p>\[
J_{\mathrm{D}}^{\sigma}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}; \sigma) - \nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x}) \right|\right|_2^2 \bigg]
\]</p>
<p>In order to sample, <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a> proposed a modified version of Langevin Dynamics termed as “Annealed Langevin Dynamics”. The idea is simple: we start from a random sample and run the Langevin Dynamics (Eq. 4) using \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_1)\) for \(T\) steps. We use the final sample as the initial starting point for the next Langevin Dynamics with \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_2)\). We repeat this process till we get the final sample from \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_L)\). The intuition here is to sample at coarse level first and gradually fine-tune it to get high quality samples. The exact algorithm is depicted in Algorithm 1 of <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a>.</p>
<h2 id="connection-to-stochastic-differential-equations">Connection to Stochastic Differential Equations</h2>
<p>Recently, <a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a> have established a surprising connection between Score Models, <a href="https://arxiv.org/abs/1503.03585">Diffusion Models</a> and Stochastic Differential Equation (SDEs). Diffusion Models are another rising class of generative models fundamentally similar to score based models but with some notable differences. Since we did not discuss Diffusion Models in this article, we cannot fully explain the connection and how to properly utilize it. However, I would like to show a brief preview of where exactly SDEs show up within the material discussed in this article.</p>
<p><a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">Stochastic Differential Equations (SDEs)</a> are stochastic dymanical systems with state \(\mathbf{x}(t)\), characterized by a <em>Drift function</em> \(f(\mathbf{x}, t)\) and a <em>Diffusion function</em> \(g(\mathbf{x}, t)\)</p>
<p>\[
d \mathbf{x}(t) = f(\mathbf{x}, t) dt + g(\mathbf{x}, t) d\omega(t)
\]</p>
<p>where \(\omega(t)\) denotes the <a href="https://en.wikipedia.org/wiki/Wiener_process">Wiener Process</a> and \(d\omega(t) \sim \mathcal{N}(0, dt)\). Discritizing the above equation in time yields</p>
<p>\[\tag{6}
\mathbf{x}_{t+\Delta t} - \mathbf{x}_t = f(\mathbf{x}_t, t) \Delta t + g(\mathbf{x}_t, t) \Delta \omega\text{, with }\Delta \omega \sim \mathcal{N}(0, \Delta t)
\]</p>
<p>To find a connection now, it is only a matter of comparing Eq. 6 with Eq. 4. The sampling process defined by Langevin Dynamics is essentially an SDE discretized in time with</p>
\[\Delta t = 1 \\
f(\mathbf{x}_t, t) = \alpha_t \cdot \mathbf{s}_{\theta^*}(\mathbf{x}_t) \\
g(\mathbf{x}_t, t) = \sqrt{2 \alpha_t} \\
\Delta \omega \equiv \mathbf{z}\]
<hr />
<p>In another future article named <a href="/blog-tut/2021/12/04/diffusion-prob-models.html">An introduction to Diffusion Probabilistic Models</a>, we explored Diffusion Models along with their connection to SDEs.</p>Ayan DasGenerative Models are of immense interest in fundamental research due to their ability to model the “all important” data distribution. A large class of generative models fall into the category of Probabilistic Graphical Models or PGMs. PGMs (e.g. VAE) usually train a parametric distribution (encoded in the form of graph structure) by minimizing log-likelihood, and samples from it by virtue of ancestral sampling. GANs, another class of popular generative model, take a different approach for training as well as sampling. Both class of models however, suffer from several drawbacks, e.g. difficult log-likelihood computation, unstable training etc. Recently, efforts have been made to craft generative models that inherit all good qualities from the existing ones. One of the rising classes of generative models is called “Score based Models”. Rather than explicitly maximizing the log-likelihood of a parametric density, it creates a map to navigate the data space. Once learned, sampling is done by Langevin Dynamics, an MCMC based method that actually navigates the data space using the map and lands on regions with high probability under empirical data distribution (i.e. real data regions). In this articles, we will describe the fundamentals of Score based models along with few of its variants.anyx: Build vector animations from programmatic descriptions2021-05-01T00:00:00+00:002021-05-01T00:00:00+00:00https://ayandas.me/projs/2021/05/01/anyx<p>Project <code class="language-plaintext highlighter-rouge">anyx</code> (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although <code class="language-plaintext highlighter-rouge">anyx</code> is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. <code class="language-plaintext highlighter-rouge">anyx</code> is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like <a href="https://www.pygame.org/news">pygame</a>, <code class="language-plaintext highlighter-rouge">anyx</code> allows users to simply write a <em>description</em> of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of <code class="language-plaintext highlighter-rouge">anyx</code> is motivated largely by a similar project called <a href="https://github.com/3b1b/manim">manim</a>.</p>
<p><a href="https://ayandas.me/anyx" target="_blank" class="fa fa-github fa-3x" style="float: left; margin-right: 20px;"></a></p>
<h2 id="i-work-on-this-project-only-in-my-spare-time-and-its-not-done-yet-read-a-brief-description-by-clicking-on-the-github-icon">I work on this project only in my spare time and its not done yet. Read a brief description by clicking on the github icon.</h2>Ayan DasProject anyx (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although anyx is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. anyx is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like pygame, anyx allows users to simply write a description of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of anyx is motivated largely by a similar project called manim.Cloud2Curve: Generation and Vectorization of Parametric Sketches2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://ayandas.me/pubs/2021/03/01/pub-9<center>
<a target="_blank" class="pubicon" href="https://arxiv.org/pdf/2103.15536.pdf">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/9.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. We further aim to model sketches as a sequence of low-dimensional parametric curves. To this end, we propose an inverse graphics framework capable of approximating a raster or waypoint based stroke encoded as a point-cloud with a variable-degree Bézier curve. Building on this module, we present Cloud2Curve, a generative model for scalable high-resolution vector sketches that can be trained end-to-end using point-cloud data alone. As a consequence, our model is also capable of deterministic vectorization which can map novel raster or waypoint based sketches to their corresponding high-resolution scalable Bézier equivalent. We evaluate the generation and vectorization capabilities of our model on Quick, Draw! and K-MNIST datasets.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my CVPR '21 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/ff2a87e58efe4d72a32f008e53826776" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk at CVPR 2021</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/H8-ejwYk7PY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{das2021cloud2curve,
title={Cloud2Curve: Generation and Vectorization of Parametric Sketches},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
year={2021},
eprint={2103.15536},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
</code></pre></div></div>Ayan DasPaperDifferentiable Programming: Computing source-code derivatives2020-09-08T00:00:00+00:002020-09-08T00:00:00+00:00https://ayandas.me/blog-tut/2020/09/08/differentiable-programming<p>If you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like <a href="https://www.facebook.com/yann.lecun/posts/10155003011462143">Yann LeCun</a>, <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Andrej Karpathy</a>) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a <em>recorder</em> (often called “Tape”) that builds a computation graph <em>at runtime</em> and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an <em>implementation perspective</em> - it doesn’t really “propagate” anything. It consumes a “program” in the form of <em>source code</em> and produces the “Derivative program” (also source code) w.r.t its inputs without <em>ever actually running</em> either of them. Additionally, DiffProg allows users the flexibility to write <em>arbitrary programs</em> without constraining it to any “guidelines”.
In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>”, written in <a href="https://julialang.org/">Julia</a>) gaining popularity in the Deep Learning community.</p>
<h2 id="why-need-derivatives-in-dl-">Why need Derivatives in DL ?</h2>
<p>This is easy to answer but just for the sake of completeness - we are interested in computing derivatives of a function because of its requirement in the update rule of Gradient Descent (or any of its successor):</p>
<p>\[
\Theta := \Theta - \alpha \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta}
\]</p>
<p>Where \(\Theta\) is the set of all parameters, \(\mathcal{D}\) is the data and \(F(\Theta)\) is the function (typically loss) we want to differentiate. Our ultimate goal is to compute \(\displaystyle{ \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta} }\) given the <em>structural form</em> of \(F\). The standard way of doing this is to use “Automatic Differentiation” (AutoDiff or AD), or rather, a special case of it called “Backpropagation”. It is called Backpropagation only when the function \(F(\cdot)\) is scalar, which is mostly true in cases we care about.</p>
<h2 id="pullback-functions--backpropagation">“Pullback” functions & Backpropagation</h2>
<p>We will now see how gradients of a complex function (given its full specification) can be computed as a sequence of primitive operations. Let’s explain this with an example for simplicity: We have two inputs \(a, b\) (just symbols) and a description of the <em>scalar</em> function we want to differentiate:</p>
<p>\[
\displaystyle{F(a, b) = \frac{a}{1+b^2}}
\]</p>
<p>We can think of \(F(a, b)\) as a series of smaller computations with intermediate results, like this</p>
\[\begin{align}
y_1 &← pow(b, 2) \\
y_2 &← add(1, y_1) \\
y_3 &← div(a, y_2)
\end{align}\]
<p>I changed the pure math notations to more programmatic ones; but the meaning remains same. In order to compute gradients, we <em>augment</em> these computations and create something called a “pullback” function as an additional by-product.</p>
<p>Mathematically, the actual computation and pullback creation can be written together symbolically as:</p>
\[\tag{1}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(pow, b, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(add, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(div, a, y_2)
\end{align}\]
<p>You can think of the <em>functional</em> \(\mathcal{J}\) as the internals of the Backpropagation framework which mutates all the computation units to produce an extra entity. A pullback function (\(\mathcal{B}_i\)) is a function that takes input the gradient w.r.t the output of the corresponding function and returns the gradient w.r.t inputs of the function:</p>
<p>\[
\mathcal{B}_i : \overline{y}_i \rightarrow \overline{input_1}, \overline{input_2}, \cdots
\]</p>
<p>It is really nothing but a different view of the chain-rule of differentiation:</p>
\[\begin{align}
\frac{\partial F}{\partial b} &\leftarrow \mathcal{B}_1(\frac{\partial F}{\partial y_1}) \triangleq \frac{\partial F}{\partial y_1} \cdot \frac{\partial y_1}{\partial b} \\
\overline{b} &\leftarrow \mathcal{B}_1( \overline{y}_1 ) \triangleq \overline{y}_1 \cdot \frac{\partial y_1}{\partial b}\left[ \text{Denoting } \frac{\partial F}{\partial x}\text{ as }\overline{x} \right]
\end{align}\]
<p>We must also realize that computing \(\mathcal{B}_i\) may require values from the forward pass. For example, computing \(\overline{b}\) may need evaluating \(\displaystyle{ \frac{\partial y_1}{\partial b} }\) at the given value of \(b\). After getting access to \(\mathcal{B}_i\), we can compute the derivatives of \(F\) w.r.t \(a, b\) by invoking the pullback functions in proper (reverse) order</p>
\[\begin{align}
\overline{a}, \overline{y_2} &\leftarrow \mathcal{B}_3(\overline{y}_3) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\overline{b} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>Please note that \(y_3\) is actually \(F\) and hence \(\overline{y}_3 ≜ \displaystyle{ \frac{\partial F}{\partial y_3} = 1 }\).</p>
<h2 id="1-tape-based-implementation">1. Tape-based implementation</h2>
<p>There are couple of different ways of implementing the theory described above. The de-facto way of doing it (as of this day) is something known as “tape-based” implementation. <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow Eager Execution</code> are probably the most popular example of this type.</p>
<p>In tape-based systems, the function \(F(..)\) is specified by its full structural form. Moreover, it requires <em>runtime execution</em> in order to compute anything (be it output or derivatives). Such system keeps track of every computation via a recorder or “tape” (that’s why the name) and builds an internal computation graph. Later, when requested, the tape stops recording and works its way backwards through the recorded tape to compute derivatives.</p>
<h3 id="the-specification-of-ftheta">The specification of \(F(\Theta)\)</h3>
<p>A tape-based system requires users to provide the function \(F\) as a description of its computations following a certain guidelines. These guidelines are provided by the specific AutoDiff framework we use. Take <code class="language-plaintext highlighter-rouge">PyTorch</code> for example - we write the series of computations using the API provided by <code class="language-plaintext highlighter-rouge">PyTorch</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">b0</span> <span class="o">=</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">b0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">y1</span><span class="p">)</span>
<span class="n">y3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">div</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">y3</span>
</code></pre></div></div>
<p>Think of the framework as an entity which is solely responsible for doing all the derivative computations. You just can’t be careless to use <code class="language-plaintext highlighter-rouge">math.sum()</code> (or anything) instead <code class="language-plaintext highlighter-rouge">torch.sum()</code>, or omit the base class <code class="language-plaintext highlighter-rouge">torch.nn.Module</code>. You have to stick to the guidelines <code class="language-plaintext highlighter-rouge">PyTorch</code> laid out to be able to make use of it. When done with the definition, we can run forward and backward pass like using actual data \((a_0, b_0)\)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>model = Network(...)
F = model(a0)
F.backward()
# 'model.b0.grad' & 'a0.grad' available
</code></pre></div></div>
<p>This will cause the framework to trigger the following sequence of computations one after another</p>
\[\tag{2}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(\mathrm{torch.pow}, \mathbf{b_0}, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(\mathrm{torch.sum}, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(\mathrm{torch.div}, \mathbf{a_0}, y_2) \\
\left[ \overline{a}\right]_{a=\mathbf{a_0}}, \overline{y_2} &\leftarrow \mathcal{B}_3(1) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\left[ \overline{b}\right]_{b=\mathbf{b_0}} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>The first and last 3 lines of computation are the “forward pass” and the “backward pass” of the model respectively. Frameworks like <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow</code> typically works in this way when <code class="language-plaintext highlighter-rouge">.forward()</code> and <code class="language-plaintext highlighter-rouge">.backward()</code> calls are made in succession. Point to be noted that since we are explicitly executing a forward pass, it will cache the necessary values required for executing the pullbacks in the backward pass. An overall diagram is shown below for clarification.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/18/tape_based.png" />
<figcaption>Fig.1: Overall pipeline of tape-based backpropagation. Green arrows indicate pullback creation by the framework and magenta arrows denote the runtime execution flow. </figcaption>
</figure>
</center>
<h3 id="whats-the-problem-">What’s the problem ?</h3>
<p>As of now, it might not seem that big of a problem for regular PyTorch user (me included). The problem intensifies when you have a non-ML code base with a complicated physics model (for example) like this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">from</span> <span class="nn">other_part_of_my_model</span> <span class="kn">import</span> <span class="n">sub_part</span>
<span class="k">def</span> <span class="nf">helper_function</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="n">something</span><span class="p">:</span>
<span class="k">return</span> <span class="n">helper_function</span><span class="p">(</span><span class="n">sub_part</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="c1"># recursive call
</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">complex_physics_model</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="n">math</span><span class="p">.</span><span class="n">do_awesome_thing</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="n">helper_function</span><span class="p">(</span><span class="nb">input</span><span class="p">))</span>
<span class="p">...</span>
<span class="k">return</span> <span class="n">output</span>
</code></pre></div></div>
<p>.. and you want to use it within your <code class="language-plaintext highlighter-rouge">PyTorch</code> model and differentiate it. There is no way you can do this so easily without spending your time to <code class="language-plaintext highlighter-rouge">PyTorch</code>-ify it first.</p>
<p>There is another serious problem with this approach: the framework cannot “<em>see</em>” any computation ahead of time. For example, when the execution thread reaches the <code class="language-plaintext highlighter-rouge">torch.sum()</code> function, it has no idea that it is about to encounter <code class="language-plaintext highlighter-rouge">torch.div()</code>. The reason its important is because the framework has no way of optimizing the computation - it <em>has to</em> execute the exact sequence of computations verbatim. For example, if the function description is given as \(\displaystyle{ F(a, b) = \frac{(a + ab)}{(1 + b)} }\), this type of framework will waste its resources executing lots of operations which will ultimately yield (both in forward and backward direction) something trivial.</p>
<h2 id="2-differentiable-programming">2. Differentiable Programming</h2>
<p>Differentiable Programming (DiffProg) offers a very elegant solution to both the problems described in the previous section. <strong>DiffProg allows you to write arbitrary code <em>without following any guidelines</em> and still be able to differentiate it.</strong> At the current state of DiffProg, majority of the successful systems use something called “<em>source code transformation</em>” in order to achieve its objective.</p>
<p>Source code transformation is a technique used extensively in the field of Compiler Designing. It takes a piece of code written in some high-level language (like C++, Python etc.) and emits a <em>compiled</em> version of it typically in a relatively lower level language (like Assembly, Bytecode, IRs etc.). Specifically, the input to a DiffProg system is a description of \(y ← F(\Theta)\) as <em>source code</em> written in some language with defined input/output. The output of the system is the source code of the derivative of \(F(\Theta)\) w.r.t its inputs (i.e., \(\overline{\Theta} ← F'(\overline{y})\)). The input program has full liberty to use the native primitives of the programming language like built-in functions, conditional statements, recursion, <code class="language-plaintext highlighter-rouge">struct</code>-like facilities, memory read/write constructs and pretty much anything that the language offers.</p>
<p>Using our generic notation, we can write down such a system as</p>
\[y, \mathcal{B} \leftarrow \mathcal{J}(F, \Theta)\]
<p>where \(F\) and \(\mathcal{B}:\overline{y}\rightarrow \overline{\Theta}\) are the given function and its derivative function in the form of <em>source codes</em> (bare with me if it doesn’t make sense at this point). Just like before, the <em>source code</em> for pullback \(\mathcal{B}\) may require some intermediate variables from that of \(y\). For a concrete example, the following is be a (hypothetical) valid DiffProg system:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="c1"># following string contains the source code of F(.)
</span><span class="o">>>></span> <span class="n">input_prog</span> <span class="o">=</span> <span class="s">"""
def F(a, b):
y1 = b ** 2
y2 = 1 + y1
return a / y2
"""</span>
<span class="o">>>></span> <span class="n">y</span><span class="p">,</span> <span class="n">B</span> <span class="o">=</span> <span class="n">diff_prog</span><span class="p">(</span><span class="n">input_prog</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="mf">1.</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="mf">2.</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="mf">0.2</span>
<span class="o">>>></span> <span class="k">exec</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="c1"># get the derivative function as a live object in current session
</span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">dF</span><span class="p">(</span><span class="mf">1.</span><span class="p">))</span> <span class="c1"># 'df()' is defined in source code 'B'
</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.16</span><span class="p">)</span>
</code></pre></div></div>
<p>Please pay attention to the fact that both our problems discussed in tape-based system are effectively solved now:</p>
<ol>
<li>
<p>We no longer need to be under the umbrella of a framework as we can directly work with native code. In the above example, the source code of the given function is simply written in native python. The example shows the overall pullback source-code (i.e., <code class="language-plaintext highlighter-rouge">B</code>) and also its explicitly compiled form (i.e., <code class="language-plaintext highlighter-rouge">dF</code>). Optionally, a DiffProg system can produce readily compiled derivative function.</p>
</li>
<li>
<p>The DiffProg system can “see” the whole source-code at once, giving it the opportunity to run various optimizations. As a result, both the given program the derivative program could be much faster than the standard tape-based approaches.</p>
</li>
</ol>
<p>Although I showed the examples in Python for ease of understanding but it doesn’t really have to be Python. The theory of DiffProg is very general and can be adopted to any language. In fact, Python is NOT the language of choice for some of the first successful DiffProg systems. The one we are gonna talk about is written in a relatively new language called <a href="http://julialang.org/">Julia</a>. The Julia language and its compiler provides an excellent support for meta-programming, i.e. manipulating/analysing/constructing Julia programs within itself. This allows Julia to offer a DiffProg system that is much more flexible than naively parsing strings (like the toy example shown above). We will look into the specifics of the Julia language and its DiffProg framework called “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>” later in this article. But before that, we will look at few details about the general compiler machinery that is required to implement DiffProg systems. Since this article is mostly targetted towards people from ML/DL background, I will try my best to be reasonable about the details of compiler designing.</p>
<h3 id="static-single-assignment-ssa-form">Static Single Assignment (SSA) form</h3>
<p>A compiler (or compiler-like system) analyses a given source code by parsing it as string. Then, it creates a large and complex data structure (known as AST) containing control flow, conditionals and every fundamental language constructs. Such structure is further compiled down to a relatively low-level representation comprising the core flow of a source program. This low-level code is known as the “Intermediate Representation (IR)”.
One of its fundamental purpose is to replaces all unique variable names with a unique ID. A given source code like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(a, b)
y1 = b ^ 2
y1 = 1 + y1
return a / y1
</code></pre></div></div>
<p>a compiler can turn it into an IR (hypothetical) like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%3 = 1 + %3
return %1 / %3
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">%N</code> is a unique placeholder for a variable. However, this particular form is a little inconvenient to analyse in practice due to the possibility of a symbol redefinition (e.g. the variable <code class="language-plaintext highlighter-rouge">y1</code> in above example). Modern compilers (including Julia) use a little improved IR, called “<em>SSA (Static Single Assignment) form IR</em>” which assigns one variable only once and often introduces extra unique symbols in order to achieve that.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%4 = 1 + %3
return %1 / %4
</code></pre></div></div>
<p>Please note how it used an extra unique ID (i.e. <code class="language-plaintext highlighter-rouge">%4</code>) in order to avoid re-assignment (of <code class="language-plaintext highlighter-rouge">%3</code>).
It has been shown that such SSA-form IR (rather than direct source code) can be differentiated, and a corresponding “Derivative IR” can be retrieved. The obvious way of crafting the derivative IR of \(F\) is to use the Derivative IRs of its constituent operations, similar to what is done in tape-based method. The biggest difference is the fact that everything is now in terms of source codes (or rather IR to be precise). The compiler could craft the derivative program like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function dF(%1, %2)
# IR for forward pass
%3, B1 = J(pow, %2, 2)
%4, B2 = J(add, 1, %3)
_, B3 = J(div, %1, %4)
# IR for backward pass
%5, %6 = B3(1)
%7 = B2(%6)
%8 = B1(%7)
return %5, %8
</code></pre></div></div>
<p>The structure of the above code may resemble the sequence of computations in Eq.2, but its very different in terms of implementation (Refer to Fig.2 below). The code (IR) is constructed at compile time by a compiler-like framework (the DiffProg system). The derivative IR is then passed onto an IR-optimizer which can squeeze its guts by enabling various optimization like dead-code elimination, operation-fusion and more advanced ones. And finally compiling it down to machine code.</p>
<center>
<figure>
<img width="90%" style="padding-top: 20px;" src="/public/posts_res/18/diff_prog.png" />
<figcaption>Fig.2: Overall pipeline of a DiffProg system with source-code transformation. Green arrows indicate creation of pullback codes by the framework and magenta arrows denote composition of source code blocks. After compiler optimization, the whole source code is squeezed into highly efficient object code. </figcaption>
</figure>
</center>
<h2 id="zygote-a-diffprog-framework-for-julia"><code class="language-plaintext highlighter-rouge">Zygote</code>: A DiffProg framework for Julia</h2>
<p>Julia is a particularly interesting language when it comes to implementing a DiffProg framework. There are solid reasons why <code class="language-plaintext highlighter-rouge">Zygote</code>, one of the most successful DiffProg frameworks is written in Julia. I will try to articulate few of them below:</p>
<ol>
<li>
<p><strong>Just-In-Time (JIT) compiler:</strong> Julia’s efficient Just-in-time (JIT) compiler compiles one statement at a time and run it immediately before moving on to the next one, achieving a striking balance between interpreted and compiled languages.</p>
</li>
<li>
<p><strong>Type specialization:</strong> Julia allows writing generic/optional/loosely-typed functions that can later be instantiated using concrete types. High-density computations specifically benefit from it by casting every computation in terms of <code class="language-plaintext highlighter-rouge">Float32/Float64</code> which can in turn produce specialized instructions (e.g. <code class="language-plaintext highlighter-rouge">AVX</code>, <code class="language-plaintext highlighter-rouge">MMX</code>, <code class="language-plaintext highlighter-rouge">AVX2</code>) for modern CPUs.</p>
</li>
<li>
<p><strong>Pre-compilation:</strong> The peculiar feature that benefits <code class="language-plaintext highlighter-rouge">Zygote</code> the most is Julia’s nature of keeping track of the compilations that have already been done in the current session and DOES NOT do them again. Since DL/ML is all about computing gradients over and over again, <code class="language-plaintext highlighter-rouge">Zygote</code> compiles and optimizes the derivative program (IR) just once and runs it repeatedly (which is blazingly fast) with different value of parameters.</p>
</li>
<li>
<p><strong>LLVM IR:</strong> Julia uses <a href="https://llvm.org/">LLVM compiler infrastructure</a> as its backend and hence emits the LLVM IR known to be highly performant and used by many other prominent languages.</p>
</li>
</ol>
<p>Now, let’s see <code class="language-plaintext highlighter-rouge">Zygote</code>’s primary API, which is surprisingly simple. The central API of <code class="language-plaintext highlighter-rouge">Zygote</code> is the function <code class="language-plaintext highlighter-rouge">Zygote.gradient(..)</code> with its first argument being the function to be differentiated (written in native Julia) followed by its argument at which gradient to be computed.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="k">using</span> <span class="n">Zygote</span>
<span class="n">julia</span><span class="o">></span> <span class="k">function</span><span class="nf"> F</span><span class="x">(</span><span class="n">x</span><span class="x">)</span>
<span class="k">return</span> <span class="mi">3</span><span class="n">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">2</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">julia</span><span class="o">></span> <span class="n">gradient</span><span class="x">(</span><span class="n">F</span><span class="x">,</span> <span class="mi">5</span><span class="x">)</span>
<span class="x">(</span><span class="mi">32</span><span class="x">,)</span>
</code></pre></div></div>
<p>That is basically computing \(\displaystyle{ \left[ \frac{\partial F}{\partial x} \right]_{x=5} }\) for \(F(x) = 3x^2 + 2x + 1\).</p>
<p>For debugging purpose, we can see the <em>actual</em> LLVM IR code for a given function and its pullback. The actual IR is a bit more complex in reality than what I showed but similar in high-level structure. We can peek into the IR of the above function</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_ir</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">3</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">4</span> <span class="o">=</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">)()</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">literal_pow</span><span class="x">(</span><span class="n">Main</span><span class="o">.:^</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="o">%</span><span class="mi">5</span>
<span class="o">%</span><span class="mi">7</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="o">%</span><span class="mi">2</span>
<span class="o">%</span><span class="mi">8</span> <span class="o">=</span> <span class="o">%</span><span class="mi">6</span> <span class="o">+</span> <span class="o">%</span><span class="mi">7</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">8</span>
</code></pre></div></div>
<p>.. and also its “Adjoint”. The adjoint in Zygote is basically the mathematical functional \(\mathcal{J}(\cdot)\) that we’ve been seeing all along.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_adjoint</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="n">Zygote</span><span class="o">.</span><span class="kt">Adjoint</span><span class="x">(</span><span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span> <span class="o">::</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">Context</span><span class="x">,</span> <span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">_pullback</span><span class="x">(</span><span class="o">%</span><span class="mi">4</span><span class="x">,</span> <span class="o">%</span><span class="mi">5</span><span class="x">)</span>
<span class="o">...</span>
<span class="c"># please run yourself to see the full code</span>
<span class="o">...</span>
<span class="o">%</span><span class="mi">13</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradindex</span><span class="x">(</span><span class="o">%</span><span class="mi">12</span><span class="x">,</span> <span class="mi">1</span><span class="x">)</span>
<span class="o">%</span><span class="mi">14</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">accum</span><span class="x">(</span><span class="o">%</span><span class="mi">6</span><span class="x">,</span> <span class="o">%</span><span class="mi">10</span><span class="x">)</span>
<span class="o">%</span><span class="mi">15</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">tuple</span><span class="x">(</span><span class="nb">nothing</span><span class="x">,</span> <span class="o">%</span><span class="mi">14</span><span class="x">)</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">15</span><span class="x">)</span>
</code></pre></div></div>
<p>I have established throughout this article that the function \(F(x)\) can literally be any arbitrary program written in native Julia using standard language features.
Let’s see another toy (but meaningful) program using more flexible Julia code.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span><span class="nc"> Point</span>
<span class="n">x</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">y</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">Point</span><span class="x">(</span><span class="n">x</span><span class="o">::</span><span class="kt">Real</span><span class="x">,</span> <span class="n">y</span><span class="o">::</span><span class="kt">Real</span><span class="x">)</span> <span class="o">=</span> <span class="n">new</span><span class="x">(</span><span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">x</span><span class="x">),</span> <span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">y</span><span class="x">))</span>
<span class="k">end</span>
<span class="c"># Define operator overloads for '+', '*', etc.</span>
<span class="k">function</span><span class="nf"> distance</span><span class="x">(</span><span class="n">p₁</span><span class="o">::</span><span class="n">Point</span><span class="x">,</span> <span class="n">p₂</span><span class="o">::</span><span class="n">Point</span><span class="x">)</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">δp</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">-</span> <span class="n">p₂</span>
<span class="n">norm</span><span class="x">([</span><span class="n">δp</span><span class="o">.</span><span class="n">x</span><span class="x">,</span> <span class="n">δp</span><span class="o">.</span><span class="n">y</span><span class="x">])</span>
<span class="k">end</span>
<span class="n">p₁</span><span class="x">,</span> <span class="n">p₂</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span> <span class="mf">3.0</span><span class="x">),</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mf">2.</span><span class="x">,</span> <span class="mi">0</span><span class="x">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mi">1</span><span class="o">//</span><span class="mi">3</span><span class="x">,</span> <span class="mf">1.0</span><span class="x">)</span>
<span class="c"># initial parameters</span>
<span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span> <span class="o">=</span> <span class="mf">0.1</span><span class="x">,</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="o">:</span><span class="mi">1000</span> <span class="c"># no. of epochs</span>
<span class="c"># compute gradients</span>
<span class="nd">@time</span> <span class="n">δK₁</span><span class="x">,</span> <span class="n">δK₂</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradient</span><span class="x">(</span><span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span><span class="x">)</span> <span class="k">do</span> <span class="n">k₁</span><span class="o">::</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">k₂</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">p̂</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">*</span> <span class="n">k₁</span> <span class="o">+</span> <span class="n">p₂</span> <span class="o">*</span> <span class="n">k₂</span>
<span class="n">distance</span><span class="x">(</span><span class="n">p̂</span><span class="x">,</span> <span class="n">p</span><span class="x">)</span> <span class="c"># scalar output of the function</span>
<span class="k">end</span>
<span class="c"># update parameters</span>
<span class="n">K₁</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₁</span>
<span class="n">K₂</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₂</span>
<span class="k">end</span>
<span class="nd">@show</span> <span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span>
<span class="c"># shows "(K₁, K₂) = (0.33427804653861276, 0.4996408206795386)"</span>
</code></pre></div></div>
<p>The above program is basically solving the following (pretty simple) problem</p>
\[\begin{align}
\arg\min_{K_1,K_2} &\vert\vert \widehat{p}(K_1,K_2) - p \vert\vert_2 \\
\text{with }&\widehat{p}(K_1,K_2) ≜ p_1 \cdot K_1 + p_2 \cdot K_2
\end{align}\]
<p>where \(p=[-1/3, 1]^T, p_1=[2,3]^T\) and \(p_2=[-2,0]^T\). By choosing these specific numbers, I guaranteed that there is a solution for \(K_1,K_2\).</p>
<p>If you look at the program at a glance, you would notice that the whole program is almost entirely written in native Julia using structure (<code class="language-plaintext highlighter-rouge">struct Point</code>), built-in function (<code class="language-plaintext highlighter-rouge">norm()</code>, <code class="language-plaintext highlighter-rouge">convert()</code>), memory access constructs (<code class="language-plaintext highlighter-rouge">δp.x</code>, <code class="language-plaintext highlighter-rouge">δp.y</code>) etc. The only usage of Zygote is that <code class="language-plaintext highlighter-rouge">Zygote.gradient()</code> call in the heart of the loop. BTW, I omitted the operator overloading functions for space restrictions.</p>
<p>I am not showing the IR codes for this one; you are free to execute <code class="language-plaintext highlighter-rouge">@code_ir</code> and <code class="language-plaintext highlighter-rouge">@code_adjoint</code> on the function implicitly defined using the <code class="language-plaintext highlighter-rouge">do .. end</code>. One thing I would like to show is the execution speed and my earlier argument about “precompilation”. The time measuring macro (<code class="language-plaintext highlighter-rouge">@time</code>) shows this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 11.764279 seconds (26.50 M allocations: 1.342 GiB, 4.58% gc time)
0.000025 seconds (44 allocations: 2.062 KiB)
0.000026 seconds (44 allocations: 2.062 KiB)
0.000007 seconds (44 allocations: 2.062 KiB)
0.000006 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
</code></pre></div></div>
<p>Did you see how the execution time reduced by an astonishingly high margin ? That’s Julia’s precompilation at work - it compiled the derivative program only once (on its first encounter) and produces highly efficient code to be reused later. It might not be as big a surprise if you already know Julia, but it is definitely a huge advantage for a DiffProg framework.</p>
<hr />
<p>Okay, that’s about it today. See you next time. The following references have been used for preparing this article:</p>
<ol>
<li>“Don’t Unroll Adjoint: Differentiating SSA-Form Programs”, Michael Innes, <a href="https://arxiv.org/abs/1810.07951">arXiv/1810.07951</a>.</li>
<li><a href="https://www.youtube.com/watch?v=LjWzgTPFu14">Talk</a> by Michael Innes @ Julia london user group meetup.</li>
<li><a href="https://www.youtube.com/watch?v=Sv3d0k7wWHk">Talk</a> by Elliot Saba & Viral Shah @ Microsoft research.</li>
<li><a href="https://fluxml.ai/Zygote.jl/latest/">Zygote’s documentation</a> & <a href="https://docs.julialang.org/en/v1/">Julia’s documentation</a>.</li>
</ol>Ayan DasIf you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like Yann LeCun, Andrej Karpathy) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a recorder (often called “Tape”) that builds a computation graph at runtime and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an implementation perspective - it doesn’t really “propagate” anything. It consumes a “program” in the form of source code and produces the “Derivative program” (also source code) w.r.t its inputs without ever actually running either of them. Additionally, DiffProg allows users the flexibility to write arbitrary programs without constraining it to any “guidelines”. In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “Zygote”, written in Julia) gaining popularity in the Deep Learning community.Energy Based Models (EBMs): A comprehensive introduction2020-08-13T00:00:00+00:002020-08-13T00:00:00+00:00https://ayandas.me/blog-tut/2020/08/13/energy-based-models-one<p>We talked extensively about <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a> in my earlier article and also described <a href="/blog-tut/2020/01/01/variational-autoencoder.html">one particular model</a> following the principles of Variational Inference (VI). There exists another class of models conveniently represented by <em>Undirected</em> Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as <strong>Energy Based Models (EBM)</strong>, as we shall see, they rely on something called <em>Energy Functions</em>. In the early days of this Deep Learning <em>renaissance</em>, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as <strong>Boltmann Machines (BM)</strong> which are very well known in the literature.</p>
<h2 id="undirected-graphical-models">Undirected Graphical Models</h2>
<p>The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a>. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.</p>
<center>
<figure>
<img width="65%" style="padding-top: 20px;" src="/public/posts_res/17/undir_models.png" />
<figcaption>Fig.1: (a) An atom lattice model. (b) An arbitrary undirected model.</figcaption>
</figure>
</center>
<p>We model a set of random variables \(\mathbf{X}\) (in our example, \(\{ A,B,C,D \}\)) whose connections are defined by graph \(\mathcal{G}\) and have <em>“potential functions”</em> defined on each of its maximal <a href="https://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a> \(\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})\). The total potential of the graph is defined as</p>
<p>\[
\Phi(\mathbf{x}) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q)
\]</p>
<p>\(q\) is an arbitrary instantiation of the set of RVs denoted by \(\mathcal{Q}\). The potential functions \(\phi_{\mathcal{Q}}(q)\in\mathbb{R}_{>0}\) are basically “affinity” functions on the state space of the cliques, e.g. given a state \(q\) of a clique \(\mathcal{Q}\), the corresponding potential function \(\phi_{\mathcal{Q}}(q)\) returns the <em>viability of that state</em> OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are <em>arbitrary non-negative values</em>. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as \(\displaystyle{ \Phi(a,b,c,d) = \phi_{\{A,B,C\}}(a,b,c)\cdot \phi_{\{A,D\}}(a,d) }\). If we assume the variables \(\{ A,D \}\) are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:</p>
\[\phi_{\{A,D\}}(a=0,d=0) = +4.00 \\
\phi_{\{A,D\}}(a=0,d=1) = +0.23 \\
\phi_{\{A,D\}}(a=1,d=0) = +5.00 \\
\phi_{\{A,D\}}(a=1,d=1) = +9.45\]
<p>Just like every other model in machine learning, the potential functions can be parameterized, leading to</p>
<p>\[ \tag{1}
\Phi(\mathbf{x}; \Theta) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})
\]</p>
<p>Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.</p>
<h2 id="reparameterizing-in-terms-of-energy">Reparameterizing in terms of Energy</h2>
<p>When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of <strong>energy</strong> functions \(E_{\mathcal{Q}}\) where</p>
<p>\[\tag{2}
\phi_{\mathcal{Q}}(q, \Theta_{\mathcal{Q}}) = \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}))
\]</p>
<p>The \(\exp(\cdot)\) enforces the potentials to be always non-negative and thus we are free to choose an <em>unconstrained</em> energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is <em>reverts the semantic meaning</em> of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are <em>bad</em>, i.e. less energy means a stable system.</p>
<p>Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields</p>
\[\begin{align}
\Phi(\mathbf{x}; \Theta) &= \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})) \\
\tag{3}
&= \exp\left(-\sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})\right) =
\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta))
\end{align}\]
<p>Here we defined \({ E_{\mathcal{G}}(\mathbf{x}; \Theta) \triangleq \sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}) }\) to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph <em>from multiplicative (Eq.1) to additive (Eq.3)</em>. This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.</p>
<p>All this is fine .. well .. unless we need to do things like <em>sampling</em>, <em>computing log-likelihood</em> etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.</p>
<h2 id="back-to-probabilities">Back to Probabilities</h2>
<p>The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain</p>
\[\begin{align}
p(\mathbf{x}; \Theta) &= \frac{\Phi(\mathbf{x}; \Theta)}{
\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \Phi(\mathbf{x}'; \Theta)
} \\ \\
\tag{4}
&= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta)/\tau)}{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}(\mathbf{x}'; \Theta)/\tau)}\text{ (using Eq.3)}
\end{align}\]
<p>This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss \(\tau\) shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to <code class="language-plaintext highlighter-rouge">Boltzmann Distribution</code>. Here’s what the <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Wikipedia</a> says:</p>
<blockquote>
<p>In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …</p>
</blockquote>
<p>From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px;" src="/public/posts_res/17/energy_prob.png" />
<figcaption>Fig.2: An energy function and its corresponding probability distribution.</figcaption>
</figure>
</center>
<p>The denominator of Eq.4 is often known as the “Partition Function” (denoted as \(Z\)). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of \(\mathbf{X}\).</p>
<p>A hyper-parameter called “temperature” (denoted as \(\tau\)) is often introduced in Eq.4 which also has its roots in the original <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann Distribution from Physics</a>. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider \(\tau=1\) for the rest of the article.</p>
<h2 id="a-general-learning-algorithm">A general learning algorithm</h2>
<p>The question now is - how do I learn the model given a dataset ? Let’s say my dataset has \(N\) samples: \(\mathcal{D} = \{ x^{(i)} \}_{i=1}^N\). An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset</p>
\[\begin{align}
\mathcal{L}(\Theta; \mathcal{D}) = - \log \prod_{i=1}^N p(x^{(i)}; \Theta) &= \sum_{i=1}^N -\log p(x^{(i)}; \Theta) \\
&= \underbrace{\frac{1}{N}\sum_{i=1}^N}_{\text{expectation}} \left[ E_{\mathcal{G}}(x^{(i)}; \Theta) \right] + \log Z\\
&\text{(putting Eq.4 followed by trivial calculations, and}\\
&\text{ dividing loss by constant N doesn't affect optima)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\bigl[ E_{\mathcal{G}}(x; \Theta) \bigr] + \log Z
\end{align}\]
<p>Computing gradient w.r.t. parameters yields</p>
\[\begin{align}
\frac{\partial \mathcal{L}}{\partial \Theta} &= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{\partial}{\partial \Theta} \log Z \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{1}{Z} \frac{\partial}{\partial \Theta} \left[ \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}) \right]\text{ (using definition of Z)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \underbrace{\frac{1}{Z} \exp(-E_{\mathcal{G}})}_{\text{RHS of Eq.4}} \cdot \frac{\partial (-E_{\mathcal{G}})}{\partial \Theta}\\
&\text{ (Both Z and the partial operator are independent}\\
&\text{ of x and can be pushed inside the summation)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \underbrace{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} p(\mathbf{x}'; \Theta)}_{\text{expectation}} \cdot \frac{\partial E_{\mathcal{G}}}{\partial \Theta}\\
\tag{5}
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]
\end{align}\]
<p>Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the <em>data distribution</em> - essentially picking up data from our dataset. The second expectation is on the <em>model distribution</em> - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule</p>
\[\Theta := \Theta - \lambda\cdot\mathbb{E}_{x\sim\mathcal{D}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\text{, and }
\Theta := \Theta + \lambda\cdot\mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\]
<p>The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points <em>coming from data</em>. And the second one tries to maximize (notice the difference in sign) the energy function at points <em>coming from the model</em>. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. \(p_{\Theta} \approx p_{\mathcal{D}}\). At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm <em>pushes the energy down</em> at places where original data lies; it also <em>pull the energy up</em> at places which the <em>model thinks</em> original data lies.</p>
<center>
<figure>
<img width="95%" style="padding-top: 20px;" src="/public/posts_res/17/pos_neg_phase_diagram.png" />
<figcaption>Fig.3: (a) Model is being optimized. The arrows depict the "pulling up" and "pushing down" of energy landscape. (b) Model has converged to an optimum.</figcaption>
</figure>
</center>
<p>Whatever may be the interpretation, as I mentioned before that the denominator of \(p(\mathbf{x}; \Theta)\) (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.</p>
<h2 id="gibbs-sampling">Gibbs Sampling</h2>
<p>As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the <em>conditional densities</em> of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the \(Z\) cancels out. Conditional density of one variable (say \(X_j\)) given the others (let’s denote by \(X_{-j}\)) is:</p>
\[\tag{6}
p(x_j\vert \mathbf{x}_{-j}) = \frac{p(\mathbf{x})}{p(\mathbf{x}_{-j})}
= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}))}{\sum_{x_j} \exp(-E_{\mathcal{G}}(\mathbf{x}))}
\text{ (using Eq.4)}\]
<p>I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.</p>
<p>To sample \(\mathbf{x}\sim p_{\Theta}(\mathbf{x})\), we iteratively execute the following for \(T\) iterations</p>
<ol>
<li>We have a sample from last iteration \(t-1\) as \(\mathbf{x}^{(t-1)}\)</li>
<li>We then pick one variable \(X_j\) (in some order) at a time and sample from its conditional given the others: \(x_j^{(t)}\sim p(x_j\vert \underbrace{x_1^{(t)}, \cdots, x_{j-1}^{(t)}}_{\text{current iteration}}, \underbrace{x_{j+1}^{(t-1)}, \cdots, x_D^{(t-1)}}_{\text{previous iteration}})\). Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.</li>
</ol>
<p>We can start this process by setting \(\mathbf{x}^{(0)}\) to anything. If \(T\) is sufficiently large, the samples towards the end are true samples from the density \(p_{\Theta}\). To know it a bit more rigorously, I <strong>highly recommend</strong> to <a href="https://en.wikipedia.org/wiki/Gibbs_sampling#Implementation">go through this</a>.
You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">MCMC family algorithm</a> which has something called “Burn-in period”.</p>
<p>Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.</p>
<h2 id="boltzmann-machine">Boltzmann Machine</h2>
<p>Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector \(\mathbf{X}\in\{0,1\}^D\) with \(D\) components \([ X_1, X_2, \cdots, X_D ]\). All \(D\) RVs are connected to all others by an undirected graph \(\mathcal{G}\).</p>
<center>
<figure>
<img width="30%" style="padding-top: 20px;" src="/public/posts_res/17/bm_diagram.png" />
<figcaption>Fig.4: Undirected graph representing Boltzmann Machine</figcaption>
</figure>
</center>
<p>By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:</p>
\[\tag{7}
E_{\mathcal{G}}(\mathbf{x}; W) = - \frac{1}{2} \mathbf{x}^T W \mathbf{x}\]
<p>Upon expanding the vectorized form (reader is encouraged to try), we can see each term \(x_i\cdot W_{ij}\cdot x_j\) for all \(i\lt j\) as the contribution of pair of RVs \((X_i, X_j)\) to the whole energy function. \(W_{ij}\) is the “connection strength” between them. If a pair of RVs \((X_i, X_j)\) turn on together more often, a high value of \(W_{ij}\) would encourage reducing the total energy. So by means of learning, we expect to see \(W_{ij}\) going up if \((X_i, X_j)\) fire together. This phenomenon is the founding idea of a closely related learning strategy called <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian Learning</a>, proposed by Donald Hebb. Hebbian theory basically says:</p>
<blockquote>
<p>If fire together, then wire together</p>
</blockquote>
<p>How do we learn this model then ? We have already seen the general way of computing gradient. We have \(\displaystyle{ \frac{\partial E_{\mathcal{G}}}{\partial W} = -\mathbf{x}\mathbf{x}^T }\). So let’s use Eq.5 to derive a learning rule:</p>
\[W := W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{W})}[ -\mathbf{x}\mathbf{x}^T ] \right)\]
<p>Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:</p>
\[p(x_j = 1\vert \mathbf{x}_{-j}; W) = \sigma\left(W_{-j}^T\cdot \mathbf{x}_{-j}\right)\]
<p>where \(\sigma(\cdot)\) is the Sigmoid function and \(W_{-j}\in\mathbb{R}^{D-1}\) denote the vector of parameters connecting \(x_j\) with the rest of the variables \(\mathbf{x}_{-j}\in\mathbb{R}^{D-1}\). I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:</p>
<center>
<figure>
<img width="25%" style="padding-top: 20px;" src="/public/posts_res/17/bm_conditional.png" />
<figcaption>Fig.5: The computational view of BM showing its dependencies by arrows.</figcaption>
</figure>
</center>
<h2 id="boltzmann-machine-with-latent-variables">Boltzmann Machine with latent variables</h2>
<p>To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help <em>explaining</em> the visible variables (see my <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGM</a> article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).</p>
<p><strong>[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]</strong></p>
<center>
<figure>
<img width="70%" style="padding-top: 20px;" src="/public/posts_res/17/hbm_diagram.png" />
<figcaption>Fig.6: (a) Undirected graph of BM with hidden units (shaded ones are visible). (b) Computational view of the model while computing conditionals. </figcaption>
</figure>
</center>
<p>Suppose we have \(K\) hidden units and \(D\) visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden (\(W\in\mathbb{R}^{D\times K}\)), visible-visible (\(V\in\mathbb{R}^{D\times D}\)) and hidden-hidden (\(U\in\mathbb{R}^{K\times K}\)) interactions. We compactly represent them as \(\Theta \triangleq \{ W, U, V \}\).</p>
\[E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}; \Theta) = -\mathbf{x}^T W \mathbf{h} - \frac{1}{2} \mathbf{x}^T V \mathbf{x} - \frac{1}{2} \mathbf{h}^T U \mathbf{h}\]
<p>The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.</p>
\[p(\mathbf{x}; \Theta) = \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} p(\mathbf{x}, \mathbf{h}; \Theta)
= \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}))}{\sum_{\mathbf{x}',\mathbf{h}'\in\mathrm{Dom}(\mathbf{X}, \mathbf{H})} \exp(-E_{\mathcal{G}}(\mathbf{x}', \mathbf{h}'))}\]
<p>It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:</p>
\[\begin{align}
p(h_k\vert \mathbf{x}, \mathbf{h}_{-k}) = \sigma( W\cdot\mathbf{x} + U_{-k}\cdot\mathbf{h}_{-k} ) \\
p(x_j\vert \mathbf{h}, \mathbf{x}_{-j}) = \sigma( W\cdot\mathbf{h} + V_{-j}\cdot\mathbf{x}_{-j} )
\end{align}\]
<p>Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.</p>
<p>Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters</p>
\[\begin{align}
W &:= W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x,h}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{x,h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{h}^T ] \right)\\
V &:= V - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{x}^T ] \right)\\
U &:= U - \lambda \cdot \left( \mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}[ -\mathbf{h}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{h}\mathbf{h}^T ] \right)
\end{align}\]
<p>If you are paying attention, you might notice something strange .. how do we compute the terms \(\mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}\) (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors \(\mathbf{x}^{(i)}\) in dataset and we can get an approximate <em>complete data</em> (visible plus hidden) density as</p>
\[p_{\mathcal{D}}(\mathbf{x}^{(i)}, \mathbf{h}) = p_{\mathcal{D}}(\mathbf{x}^{(i)}) \cdot p_{\Theta}(\mathbf{h}\vert \mathbf{x}^{(i)})\]
<p>Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).</p>
<p>For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. <strong>There is a clever hack though</strong>. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “<a href="https://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf">Contrastive Divergence</a>” and has long been used in practical implementations.</p>
<h2 id="restricted-boltzmann-machine-rbm">“Restricted” Boltzmann Machine (RBM)</h2>
<p>Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.</p>
<p>RBM is basically same as Boltzmann Machine with hidden units but with <em>one big difference</em> - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.</p>
\[U = \mathbf{0}, V = \mathbf{0}\]
<p>If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of \(U\) and \(V\) from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/17/rbm_diag_and_cond.png" />
<figcaption>Fig.7: (a) Graphical diagram of RBM. (b) Arrows just show computation deps</figcaption>
</figure>
</center>
<p>Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.</p>
\[\begin{align}
p(h_k\vert \mathbf{x}) = \sigma( W_{[:,k]}\cdot\mathbf{x} )\\
p(x_j\vert \mathbf{h}) = \sigma( W_{[j,:]}\cdot\mathbf{h} )
\end{align}\]
<p>That means they can be computed in parallel</p>
\[\begin{align}
p(\mathbf{h}\vert \mathbf{x}) = \prod_{k=1}^K p(h_k\vert \mathbf{x}) = \sigma( W\cdot\mathbf{x} )\\
p(\mathbf{x}\vert \mathbf{h}) = \prod_{j=i}^D p(x_j\vert \mathbf{h}) = \sigma( W\cdot\mathbf{h} )
\end{align}\]
<p>Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:</p>
<ol>
<li>Sample a hidden vector \(\mathbf{h}^{(t)}\sim p(\mathbf{h}\vert \mathbf{x}^{(t-1)})\)</li>
<li>Sample a visible vector \(\mathbf{x}^{(t)}\sim p(\mathbf{x}\vert \mathbf{h}^{(t)})\)</li>
</ol>
<p>This makes RBM an attractive choice for practical implementation.</p>
<hr />
<p>Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://www.cs.toronto.edu/~hinton/csc321/readings/boltz321.pdf">Boltzmann Machine, by G. Hinton, 2007</a></li>
<li><a href="https://www.crim.ca/perso/patrick.kenny/BMNotes.pdf">Notes on Boltzmann Machine, by Patrick Kenny</a></li>
<li><a href="http://deeplearning.net/tutorial/rbm.html">deeplearning.net documentation</a></li>
<li><a href="https://www.youtube.com/watch?v=2fRnHVVLf1Y&list=PLiPvV5TNogxKKwvKb1RKwkq2hm7ZvpHz0">Hinton’s coursera course</a></li>
<li><a href="https://www.deeplearningbook.org/">Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville</a></li>
</ol>Ayan DasWe talked extensively about Directed PGMs in my earlier article and also described one particular model following the principles of Variational Inference (VI). There exists another class of models conveniently represented by Undirected Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as Energy Based Models (EBM), as we shall see, they rely on something called Energy Functions. In the early days of this Deep Learning renaissance, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as Boltmann Machines (BM) which are very well known in the literature.Pixelor: A Competitive Sketching AI Agent. So you think you can sketch?2020-07-30T00:00:00+00:002020-07-30T00:00:00+00:00https://ayandas.me/pubs/2020/07/30/pub-8<center>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/pdf/10.1145/3414685.3417840">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/abs/10.1145/3414685.3417840">
<i class="fa fa-files-o fa-3x"></i>Suppl.
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Demo</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="4">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/8.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my SIGGRAPH Asia 2020 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/159a510c082643ea89a012555fdfcc67" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk (15 mins) at SIGGRAPH Asia 2020</h2>
<iframe width="800" height="450" src="https://www.youtube.com/embed/oSk2x5HuCA8" frameborder="1" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<br />
<h4>Watch a <a href="https://www.youtube.com/watch?v=E_Aclms4g-w" target="_blank">short summary</a> video instead</h4>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<h2><a href="http://surrey.ac:9999/">Try out the Demo</a> (screenshot below)</h2>
<figure>
<img width="75%" src="/public/pub_res/8_2.gif" />
</figure>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="4">
<div class="accordion-item__container">
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/neuralsort-siggraph">
<i class="fa fa-file-code-o fa-3x"></i>NeuralSort repo
</a>
<a target="_blank" class="pubicon" href="https://github.com/AyanKumarBhunia/sketch-transformerMMD">
<i class="fa fa-file-code-o fa-3x"></i>Transformer MMD repo
</a>
<br />
<h2>The "SlowSketch" dataset</h2>
<img border="2px" width="80%" src="/public/pub_res/8_3.png" alt="SlowSketch" />
<a target="_blank" class="pubicon" href="https://drive.google.com/u/0/uc?export=download&confirm=n4LZ&id=1mWEY7vFkOw790DwUtqcTX8fHzNBP_b1J">
<i class="fa fa-database fa-3x"></i>SlowSketch
</a>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{pixelor20siga,
author = {Bhunia, Ayan Kumar and Das, Ayan and Muhammad, Umar Riaz and Yang, Yongxin and Hospedales, Timothy M. and Xiang, Tao and Gryaditskaya, Yulia and Song, Yi-Zhe},
title = {Pixelor: A Competitive Sketching AI Agent. so You Think You Can Sketch?},
year = {2020},
publisher = {Association for Computing Machinery},
volume = {39},
number = {6},
journal = {ACM Trans. Graph.},
articleno = {166},
numpages = {15}
}
</code></pre></div></div>Ayan Kumar BhuniaPaper Suppl.rlx: A modular Deep RL library for research2020-06-27T00:00:00+00:002020-06-27T00:00:00+00:00https://ayandas.me/projs/2020/06/27/rlx-deep-rl-library<p><code class="language-plaintext highlighter-rouge">rlx</code> is a Deep RL library written on top of PyTorch & built for <em>educational and research</em> purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but <code class="language-plaintext highlighter-rouge">rlx</code> is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, <code class="language-plaintext highlighter-rouge">rlx</code> adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).</p>
<p><a href="https://github.com/dasayan05/rlx" target="_blank" class="fa fa-github fa-3x" style="float: right;"></a></p>
<p>Concisely, <code class="language-plaintext highlighter-rouge">rlx</code> is supposed to</p>
<ol>
<li>Be generic (i.e., can be adopted for any task at hand)</li>
<li>Have modular lower-level components exposed to users</li>
<li>Be easy to implement new algorithms</li>
</ol>
<p>For the sake of completeness, it also provides few popular algorithms as baseline (more to be added soon). Here’s a basic example of PPO (with clipping) implementation with <code class="language-plaintext highlighter-rouge">rlx</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">base_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(</span><span class="n">horizon</span><span class="p">)</span> <span class="c1"># sample an episode as a 'Rollout' object
# 'rewards' and 'logprobs' for all timesteps
</span><span class="n">base_rewards</span><span class="p">,</span> <span class="n">base_logprobs</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">logprobs</span>
<span class="n">base_returns</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># Monte-carlo estimates of 'returns'
</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k_epochs</span><span class="p">):</span>
<span class="c1"># 'evaluate' an episode against a policy and get a new 'Rollout' object
</span> <span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">base_rollout</span><span class="p">)</span>
<span class="n">logprobs</span><span class="p">,</span> <span class="n">entropy</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">entropy</span> <span class="c1"># get 'logprobs' and 'entropy' for all timesteps
</span> <span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># .. also 'value' estimates
</span>
<span class="n">ratios</span> <span class="o">=</span> <span class="p">(</span><span class="n">logprobs</span> <span class="o">-</span> <span class="n">base_logprobs</span><span class="p">.</span><span class="n">detach</span><span class="p">()).</span><span class="n">exp</span><span class="p">()</span>
<span class="n">advantage</span> <span class="o">=</span> <span class="n">base_returns</span> <span class="o">-</span> <span class="n">values</span>
<span class="n">policyloss</span> <span class="o">=</span> <span class="o">-</span> <span class="n">torch</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">clip</span><span class="p">,</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">clip</span><span class="p">))</span> <span class="o">*</span> <span class="n">advantage</span><span class="p">.</span><span class="n">detach</span><span class="p">()</span>
<span class="n">valueloss</span> <span class="o">=</span> <span class="n">advantage</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">policyloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">valueloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">-</span> <span class="n">entropy</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="n">agent</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">agent</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>
<p>This is all you have to write to get PPO running.</p>
<h2 id="design-and-api">Design and API</h2>
<p>User needs to provide a parametric function that defines the computation at <em>each time-step</em> and follows a specific signature (i.e., <code class="language-plaintext highlighter-rouge">rlx.Parametric</code>). <code class="language-plaintext highlighter-rouge">rlx</code> will take care of the rest e.g., tie them up to form full rollouts, preserving recurrence (it works seamlessly with recurrent policies) etc.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PolicyValueModule</span><span class="p">(</span><span class="n">rlx</span><span class="p">.</span><span class="n">Parametric</span><span class="p">):</span>
<span class="s">""" Recurrent policy network with state-value (baseline) prediction """</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">states</span><span class="p">):</span>
<span class="c1"># Recurrent state from the last time-step will come in automatically
</span> <span class="n">recur_state</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">states</span>
<span class="p">...</span>
<span class="n">action1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Normal</span><span class="p">(...)</span>
<span class="n">action2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(...)</span>
<span class="n">state_value</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_value_net</span><span class="p">(...)</span>
<span class="k">return</span> <span class="n">next_recur_state</span><span class="p">,</span> <span class="n">rlx</span><span class="p">.</span><span class="n">ActionDistribution</span><span class="p">(</span><span class="n">action1</span><span class="p">,</span> <span class="n">action2</span><span class="p">,</span> <span class="p">...),</span> <span class="n">state_value</span>
<span class="n">network</span> <span class="o">=</span> <span class="n">PolicyValueModule</span><span class="p">(...)</span>
</code></pre></div></div>
<p>While the <code class="language-plaintext highlighter-rouge">next_recur_state</code> and <code class="language-plaintext highlighter-rouge">state_value</code> are optional (i.e., can be <code class="language-plaintext highlighter-rouge">None</code>), a multi-component action distribution needs to be returned. <code class="language-plaintext highlighter-rouge">rlx</code> will take care of sampling from it and computing log-probabilities. The first two return values are necessary, the rest are optional. You can return any number of quantity after first two arguments as <em>extras</em> - they will all be tracked.</p>
<hr />
<p>The design is centered around the primary data structure <code class="language-plaintext highlighter-rouge">Rollout</code> which can hold a sequence of experience tuples <code class="language-plaintext highlighter-rouge">(state, action, reward)</code>, action distributions and any arbitrary quantity returned from the <code class="language-plaintext highlighter-rouge">rlx.Parametric.forward()</code>. <code class="language-plaintext highlighter-rouge">Rollout</code> internally keeps track of the computation graph (if necessary/requested). One has to sample a <code class="language-plaintext highlighter-rouge">Rollout</code> instance by running the agent in the environment. The rollout can then provide quantities like log-probs and anything else that was tracked, upon request.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">set_grad_enabled</span><span class="p">(...):</span>
<span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">network</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># populate its 'returns' property to naive Monte-Carlo returns
</span> <span class="n">logprobs</span><span class="p">,</span> <span class="n">returns</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span>
<span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># any 'extra' quantity computed will be available as rollout.others
</span></code></pre></div></div>
<p>We can enable/disable gradients by the pytorch way (i.e., <code class="language-plaintext highlighter-rouge">torch.set_grad_enabled(..)</code> etc.).</p>
<p>The flag <code class="language-plaintext highlighter-rouge">dry=True</code> means the rollout instance will only hold <code class="language-plaintext highlighter-rouge">(state, action, reward)</code> tuples and nothing else. This design allows the rollouts to be re-evaluated against another policy - as required by some algorithms (like PPO). Such rollouts cannot offer logprobs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 'rollout' is not dry, it has computation graph attached
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">other_policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">dry_rollout</span><span class="p">)</span>
</code></pre></div></div>
<p>This API has another benefit. One can sample an episode from a policy in dry-mode, then <code class="language-plaintext highlighter-rouge">.vectorize()</code> it and re-evaluate it against the same policy. This bring in computational benefits.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">dry_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="n">Try</span><span class="p">)</span>
<span class="n">dry_rollout_vec</span> <span class="o">=</span> <span class="n">dry_rollout</span><span class="p">.</span><span class="n">vectorize</span><span class="p">()</span> <span class="c1"># internally creates a batch dimension for efficient processing
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evalue</span><span class="p">(</span><span class="n">dry_rollout_vec</span><span class="p">)</span>
</code></pre></div></div>
<p>If the rollout is not dry and gradients were enabled, one can directly do a backward pass</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span> <span class="o">*</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backword</span><span class="p">()</span>
</code></pre></div></div>
<hr />
<p>As you might have noticed, the network is not a part of the agent. In fact, the agent only has a copy of the environment and nothing else. One needs to <em>augment</em> the agent with a network in order for it to sample episode. This design allows us to easily run the agent using a different policy, for example, a “behavior policy” in off-policy RL</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>behaviour_rollout = agent(behavior_policy).episode(...)
behaviour_logprobs = behaviour_rollout.logprobs # record them for computing importance ratio afterwards
</code></pre></div></div>
<hr />
<p><code class="language-plaintext highlighter-rouge">Rollout</code> has a nice API which is useful for writing customized algorithm or implementation tricks. We can</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># shuffle rollouts ..
</span><span class="n">rollout</span><span class="p">.</span><span class="n">shuffle</span><span class="p">()</span>
<span class="c1"># .. index/slice them
</span><span class="n">rollout</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># remove the end-state
</span><span class="n">rollout</span><span class="p">[:</span><span class="mi">100</span><span class="p">]</span> <span class="c1"># recurrent rollouts can be too long (RNNs have long-term memory problems)
</span>
<span class="c1"># .. or even concat them
</span><span class="p">(</span><span class="n">rollout1</span> <span class="o">+</span> <span class="n">rollout2</span><span class="p">).</span><span class="n">vectorize</span><span class="p">()</span>
</code></pre></div></div>
<p>NOTE: I will write more docs if get time. Follow the algorithm implementations at <code class="language-plaintext highlighter-rouge">rlx/algos/*</code> for more API usage.</p>
<h2 id="installation-and-usage">Installation and usage</h2>
<p>Right now, there is no <code class="language-plaintext highlighter-rouge">pip</code> package, its just this repo. You can install it by cloning it and doing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install .
</code></pre></div></div>
<p>For example usage, follow the <code class="language-plaintext highlighter-rouge">main.py</code> script. You can test an algorithm by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python main.py --algo ppo --policytype rnn --batch_size 16 --max_episode 5000 --horizon 200 --env CartPole-v0 --standardize_return
</code></pre></div></div>
<p>The meaning of batch-size is a little different here. It means on how many rollouts the gradient will be averaged (Currently that’s how its done).</p>
<h2 id="experiments">Experiments</h2>
<ul>
<li>Basic environments</li>
</ul>
<p>The “Incomplete”-prefixed environments are examples of POMDP. Their state representations have been masked to create partial observability. They can be only be solved by recurrent policies.</p>
<center>
<img src="/public/proj_res/4/exp.png" />
</center>
<ul>
<li>A little modified (simplified) <code class="language-plaintext highlighter-rouge">SlimeVolleyGym-v0</code> <a href="https://github.com/hardmaru/slimevolleygym">environment by David Ha</a>. An MLP agent trained with PPO learns to play volleyball by self-play experiences, provided at <code class="language-plaintext highlighter-rouge">examples/slime.py</code>.</li>
</ul>
<center>
<img width="80%" src="/public/proj_res/4/volley.gif" />
</center>
<hr />
<h2 id="plans">Plans</h2>
<p>Currently <code class="language-plaintext highlighter-rouge">rlx</code> has following algorithms, but it is <strong>under active development</strong>.</p>
<ol>
<li>Vanilla REINFORCE</li>
<li>REINFORCE with Value-baseline</li>
<li>A2C</li>
<li>PPO with clipping</li>
<li>OffPAC</li>
</ol>
<h4 id="todo">TODO:</h4>
<ol>
<li>More SOTA algorithms (DQN, DDPG, etc.) to be implemented</li>
<li>Create a uniform API/interface to support Q-learning algorithm</li>
<li>Multiprocessing/Parallelization support</li>
</ol>
<h4 id="contributions">Contributions</h4>
<p>You are more than welcome to contribute anything.</p>Ayan Dasrlx is a Deep RL library written on top of PyTorch & built for educational and research purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but rlx is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, rlx adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).