Jekyll2021-12-07T08:19:17+00:00https://ayandas.me/feed.xmlAyan Das<b>Deep Learning</b> enthusiast; <b>Ph.D. Student</b> @ <a href="https://www.surrey.ac.uk/">University of Surrey</a>, United KingdomAyan Dasa.das@surrey.ac.ukGenerative modelling with Score Functions2021-07-14T00:00:00+00:002021-07-14T00:00:00+00:00https://ayandas.me/blog-tut/2021/07/14/generative-model-score-function<p>Generative Models are of immense interest in fundamental research due to their ability to model the “all important” data distribution. A large class of generative models fall into the category of <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models</a> or PGMs. PGMs (e.g. <a href="/blog-tut/2020/01/01/variational-autoencoder.html">VAE</a>) usually train a parametric distribution (encoded in the form of graph structure) by minimizing log-likelihood, and samples from it by virtue of ancestral sampling. GANs, another class of popular generative model, take a different approach for training as well as sampling. Both class of models however, suffer from several drawbacks, e.g. difficult log-likelihood computation, unstable training etc. Recently, efforts have been made to craft generative models that inherit all good qualities from the existing ones. One of the rising classes of generative models is called “<em>Score based Models</em>”. Rather than explicitly maximizing the log-likelihood of a parametric density, it creates <em>a map to navigate the data space</em>. Once learned, sampling is done by <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_Langevin_dynamics">Langevin Dynamics</a>, an MCMC based method that actually navigates the data space using the map and lands on regions with high probability under empirical data distribution (i.e. real data regions). In this articles, we will describe the fundamentals of Score based models along with few of its variants.</p>
<h2 id="traditional-maximum-likelihood-mle">Traditional maximum-likelihood (MLE)</h2>
<p>Traditional log-likelihood based approaches define a parametric generative process in terms of <a href="/blog-tut/2019/11/20/inference-in-pgm.html">graphical model</a> and maximize the joint density \(p_{\theta}(\mathbf{x})\) w.r.t its parameters \(\theta\)</p>
<p>\[\tag{1}
\theta^* = arg\max_{\theta} \big[ \log p_{\theta}(\mathbf{x}) \big]
\]</p>
<p>The joint density is often quite complex and sometimes intractable. For intractable cases, we maximize a surrogate objective based on e.g. <a href="/blog-tut/2020/01/01/variational-autoencoder.html">Variational Inference</a>. We achieve the above in practice by moving the parameters in the direction where the expected log-likelihood increases the most at every step \(t\). The expectation is computed empirically at points sampled from our dataset, i.e. the unknown data distribution \(p_{\mathrm{data}}(\mathbf{x})\)</p>
<p>\[
\theta_{t+1} = \theta_t + \alpha \cdot \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}}(\mathbf{x})} \big[ \nabla_{\theta} \log p_{\theta}(\mathbf{x}) \big]
\]</p>
<p>With a trained set of parameters \(\theta^*\), we sample from the model with ancestral sampling by exploiting its graphical structure</p>
<p>\[\tag{2}
\mathbf{x}_{\mathrm{sample}} \sim p_{\theta^*}(\mathbf{x})
\]</p>
<p>There is one annoying requirement both in (1) and (2): the parametric model \(p_{\theta}(\mathbf{x})\) must be a valid density. We ensure such requirement by building the model only carefully combining known densities like Gaussian, Bernoulli, Dirichlet etc. Even though they are largly sufficient in terms of expressiveness, it might feel a bit too restrictive from system designer’s perspective.</p>
<h2 id="score-based-models-sbms">Score based models (SBMs)</h2>
<p>A new and emerging class of generative model, namely “Score based models (SBM)” entirely sidesteps the log-likelihood modelling and approaches the problem in a different way. In specific, SBMs attempt to learn a <em>navigation map</em> on the data space which guides any point on that space to reach a region highly probable under the data distribution \(p_{\mathrm{data}}(\mathbf{x})\). A little but careful though on this would lead us to something formally known as the <em>Score function</em>. The “Score” of an arbitrary point \(\mathbf{x} \in \mathbb{R}^d\) on the data space is essentially the gradient of the <em>true</em> data log-likelihood on that point</p>
<p>\[\tag{3}
\mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x}) \in \mathbb{R}^d
\]</p>
<p>Please be careful and notice that the quantity on the right hand side of (3), i.e. \(\nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\) is <strong>not</strong> same as \(\nabla_{\theta} \log p_{\theta}(\mathbf{x})\), the quantity we encountered earlier (in MLE setup), even though they look structurally similar.</p>
<p>Given <em>any</em> point on the data space, the score tells us which direction to navigate if we would like see a region with higher likelihood. Unsurprisingly, if we take a little step toward the direction suggested by the score, we get a point \((\mathbf{x} + \alpha \cdot \mathbf{s}(\mathbf{x}))\) that is slightly more likely under \(p_{\mathrm{data}}(\mathbf{x})\). This is why I termed it as a “navigation map”, as in a guiding document that tells us the direction of the “treasure” (i.e. real data samples). All an SBM does is try to approximate the true score function via a parametric proxy</p>
<p>\[
\mathbf{s}_{\theta}(\mathbf{x}) \approx \mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})
\]</p>
<p>As simple as it might sound, we construct a regression problem with \(\mathbf{s}(\mathbf{x})\) as regression targets. We minimize the following loss</p>
<p>\[
J(\theta) = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) - \mathbf{s}(\mathbf{x}) \right|\right|_2^2 \bigg] = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x}) \right|\right|_2^2 \bigg]
\]</p>
<p>This is known as <em>Score Matching</em>. Once trained, we simply keep moving in the direction suggested by \(\mathbf{s}_{\theta^*}(\mathbf{x})\) starting from any random \(\mathbf{x}\) over finite time horizon \(T\). In practice, we move with a little bit of stochasticity - the formal procedure is known as <em>Langevin Dynamics</em>.</p>
<p>\[\tag{4}
\mathbf{x}_{t+1} = \mathbf{x}_{t} + \alpha_t \cdot \mathbf{s}_{\theta^*}(\mathbf{x}_t) + \sqrt{2 \alpha_t} \cdot \mathbf{z}
\]</p>
<p>\(\mathbf{z} \sim \mathcal{N}(0, I)\) is the injected gaussian noise. If \(\alpha_t \rightarrow 0\) as \(t \rightarrow \infty\), this process gurantees \(\mathbf{x}_t\) to be a true sample from \(p_{\mathrm{data}}(\mathbf{x})\). In practice, we run this process for finite number of steps \(T\) and assign \(\alpha_t\) according to a decaying schedule. Please refer to <a href="https://www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf">the original paper</a> for detailed discussion.</p>
<p>Looks all good. But, there are two problems with optimizing \(J(\theta)\).</p>
<ul>
<li>
<p><strong>Problem 1:</strong> The very obvious one; we don’t have access to the true scores \(\mathbf{s}(\mathbf{x}) \triangleq \nabla_{\mathbf{x}} \log p_{\mathrm{data}}(\mathbf{x})\). No one knows the exact form of \(p_{\mathrm{data}}(\mathbf{x})\).</p>
</li>
<li>
<p><strong>Problem 2:</strong> The not-so-obvious one; the expection \(\mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})}\) is a bit problematic. Ideally, the objective must encourage learning the scores all over the data space (i.e. for every \(\mathbf{x} \in \mathbb{R}^d\)). But this isn’t possible with an expectation over only the data distribution. The regions of the data space which are unlikely under \(p_{\mathrm{data}}(\mathbf{x})\) do not get enough supervisory signals.</p>
</li>
</ul>
<h2 id="implicit-score-matching-ism">Implicit Score Matching (ISM)</h2>
<p><a href="https://jmlr.org/papers/volume6/hyvarinen05a/old.pdf">Aapo Hyvärinen, 2005</a> solved the first problem quite elegantly and proposed the <em>Implicit Score Matching</em> objective \(J_{\mathrm{I}}(\theta)\) and showed it to be equivalent to \(J(\theta)\) under some mild regulatory conditions. The following remarkable result was shown in the original paper</p>
<p>\[
J_{\mathrm{I}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \frac{1}{2} \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) \right|\right|^2 + \mathrm{tr}(\nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})) \bigg]
\]</p>
<p>The reason it’s known to be “remarkable” is the fact that \(J_{\mathrm{I}}(\theta)\) does not require the true target scores \(\mathbf{s}(\mathbf{x})\) <em>at all</em>. All we need is to compute an expectation w.r.t the data distribution which can be implemented using finite samples from our dataset. One practical problem with this objective is the amount of computation involved in the jacobian \(\nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\). Later, <a href="http://auai.org/uai2019/proceedings/papers/204.pdf">Song et al., 2019</a> proposed to use the <a href="https://www.tandfonline.com/doi/abs/10.1080/03610919008812866">Hutchinson’s trace estimator</a>, a stochastic estimator for computing the trace of a matrix, which simplified the objective further</p>
<p>\[
J_{\mathrm{I}}(\theta) = \mathbb{E}_{p_{\mathbf{v}}} \mathbb{E}_{p_{\mathrm{data}}(\mathbf{x})} \bigg[ \frac{1}{2} \left|\left| \mathbf{s}_{\theta}(\mathbf{x}) \right|\right|^2 + \mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x}) \mathbf{v} \bigg]
\]</p>
<p>where \(\mathbf{v} \sim \mathcal{N}(0, \mathbf{I}) \in \mathbb{R}^d\) is a standard multivariate gaussian RV. This objective is computationally advantageous when used in conjunction with automatic differentiation frameworks (e.g. PyTorch) which can efficiently compute the vector-jacobian product (VJP), namely \(\mathbf{v}^T \nabla_{\mathbf{x}} \mathbf{s}_{\theta}(\mathbf{x})\).</p>
<h2 id="denoising-score-matching-dsm">Denoising Score Matching (DSM)</h2>
<p>In a different approach, <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">Pascal Vincent, 2011</a> investigated the “unsuspected link” between Score Matching and <a href="https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf">Denoising Autoencoders</a>. This work led to a very efficient and practical objective that is used even in the cutting-edge Score based models. Termed as “Denoising Score Matching (DSM)”, this approach mitigates both problem 1 & 2 described above and does so quite elegantly.</p>
<p>To get rid of problem 2, DSM proposes to simply use a noise-perturbed version of the dataset, i.e. replace \(p_{\mathrm{data}}(\mathbf{x})\) with \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\) where</p>
<p>\[
p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}) = \int p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x}) d\mathbf{x} \text{, with } p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x}) = p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x}) \cdot p_{\mathrm{data}}(\mathbf{x})
\]</p>
<p>The above equation basically tells us to create a perturbed/corrupted version of the original dataset by adding simple isotropic gaussian noise whose streangth is controlled by \(\sigma\), the std deviation of the gaussian. Since gaussian distribution is spanned over the entire space \(\mathbb{R}^d\), corrupted data samples populate much more region of the entire space and help the parameterized score function learn at regions which were originally unreachable under \(p_{\mathrm{data}}(\mathbf{x})\). The denoising objective \(J_{\mathrm{D}}(\theta)\) simply becomes</p>
<p>\[
J_{\mathrm{D}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}) - \nabla_{\mathbf{\tilde{x}}} \log p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}) \right|\right|_2^2 \bigg]
\]</p>
<p>With a crucial proof shown in the appendix of the <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">original paper</a>, we can have an equivalent (changes shown in magenta) version of \(J_{\mathrm{D}}(\theta)\) as</p>
<p>\[\tag{5}
J_{\mathrm{D}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\color{magenta}{\mathbf{\tilde{x}}, \mathbf{x}})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}) - \nabla_{\mathbf{\tilde{x}}} \log \color{magenta}{p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x})} \right|\right|_2^2 \bigg]
\]</p>
<p>Note that we now need original-corrupt data pairs \((\mathbf{\tilde{x}}, \mathbf{x})\) in order to compute the expectation, which is quite trivial to do. Also realize that the term \(\nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} \vert \mathbf{x})\) is not the data score but related only to the pre-specified noise model with quite an easy analytic form</p>
<p>\[
\nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} \vert \mathbf{x}) = - \frac{1}{\sigma^2} (\mathbf{\tilde{x}} - \mathbf{x})
\]</p>
<p>The score function we learn this way isn’t actually for our original data distribution \(p_{\mathrm{data}}(\mathbf{x})\), but rather for the corrupted data distribution \(p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}})\). The strength \(\sigma\) decided how well it aligns with the original distribution. If \(\sigma\) is large, we end up learning too corrupted version of the data distribution; if \(\sigma\) is small, we no longer get the nice property out of the noise perturbation - so there is a trade-off. Recently, this trade-off has been utilized for learning robust score based models.</p>
<p>Moreover, Eq. 5 has a very intuitive interpretation and this is where <a href="https://www.iro.umontreal.ca/~vincentp/Publications/DenoisingScoreMatching_NeuralComp2011.pdf">Pascal Vincent, 2011</a> uncovered the link between DSM and Denoising Autoencoders. A closer look at Eq. 5 would reveal that the \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}})\) has a learning target of \(- \frac{1}{\sigma^2} (\mathbf{\tilde{x}} - \mathbf{x})\), which can be interpreted as an unit vector from corrupted sample towards the real sample. Succintly, the score function is trying to learn how to “de-noise” a corrupted sample - that’s precisely what Denoising Autoencoders do.</p>
<h2 id="noise-conditioned-score-network-ncsn">Noise Conditioned Score Network (NCSN)</h2>
<p>The idea presented in <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a> is to have \(L\) different noise-perturbed data distributions (with different \(\sigma\)) and one score function for each of them. The noise strengths are chosen to be \(\sigma_1 > \sigma_2 > \cdots > \sigma_L\), so that \(p_{\mathrm{data}}^{\sigma_1}(\mathbf{\tilde{x}})\) is the most corrupted distribution and \(p_{\mathrm{data}}^{\sigma_L}(\mathbf{\tilde{x}})\) is the least. Also, instead of having \(L\) separate score functions, we use one shared score function conditioned on \(\sigma\), i.e. \(\mathbf{s}_{\theta}(\mathbf{\tilde{x}}; \sigma)\).</p>
<p>We finally learn the shared score function from the ensamble of \(L\) distributions</p>
<p>\[
J_{\mathrm{ncsn}}(\theta) = \frac{1}{L} \sum_{l=1}^L \sigma^2 \cdot J_{\mathrm{D}}^{\sigma_l}(\theta)
\]</p>
<p>where \(J_{\mathrm{D}}^{\sigma}(\theta)\) is same as Eq. 5 but uses the shared score network parameterized by \(\sigma\)</p>
<p>\[
J_{\mathrm{D}}^{\sigma}(\theta) = \mathbb{E}_{p_{\mathrm{data}}^{\sigma}(\mathbf{\tilde{x}}, \mathbf{x})} \bigg[ \left|\left| \mathbf{s}_{\theta}(\mathbf{\tilde{x}}; \sigma) - \nabla_{\mathbf{\tilde{x}}} \log p_{\mathcal{N}}^{\sigma}(\mathbf{\tilde{x}} | \mathbf{x}) \right|\right|_2^2 \bigg]
\]</p>
<p>In order to sample, <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a> proposed a modified version of Langevin Dynamics termed as “Annealed Langevin Dynamics”. The idea is simple: we start from a random sample and run the Langevin Dynamics (Eq. 4) using \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_1)\) for \(T\) steps. We use the final sample as the initial starting point for the next Langevin Dynamics with \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_2)\). We repeat this process till we get the final sample from \(\mathbf{s}_{\theta^*}(\mathbf{\tilde{x}}; \sigma_L)\). The intuition here is to sample at coarse level first and gradually fine-tune it to get high quality samples. The exact algorithm is depicted in Algorithm 1 of <a href="https://openreview.net/pdf?id=B1lcYrBgLH">Song et al., 2020</a>.</p>
<h2 id="connection-to-stochastic-differential-equations">Connection to Stochastic Differential Equations</h2>
<p>Recently, <a href="https://arxiv.org/pdf/2011.13456.pdf">Song et al., 2021</a> have established a surprising connection between Score Models, <a href="https://arxiv.org/abs/1503.03585">Diffusion Models</a> and Stochastic Differential Equation (SDEs). Diffusion Models are another rising class of generative models fundamentally similar to score based models but with some notable differences. Since we did not discuss Diffusion Models in this article, we cannot fully explain the connection and how to properly utilize it. However, I would like to show a brief preview of where exactly SDEs show up within the material discussed in this article.</p>
<p><a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">Stochastic Differential Equations (SDEs)</a> are stochastic dymanical systems with state \(\mathbf{x}(t)\), characterized by a <em>Drift function</em> \(f(\mathbf{x}, t)\) and a <em>Diffusion function</em> \(g(\mathbf{x}, t)\)</p>
<p>\[
d \mathbf{x}(t) = f(\mathbf{x}, t) dt + g(\mathbf{x}, t) d\omega(t)
\]</p>
<p>where \(\omega(t)\) denotes the <a href="https://en.wikipedia.org/wiki/Wiener_process">Wiener Process</a> and \(d\omega(t) \sim \mathcal{N}(0, dt)\). Discritizing the above equation in time yields</p>
<p>\[\tag{6}
\mathbf{x}_{t+\Delta t} - \mathbf{x}_t = f(\mathbf{x}_t, t) \Delta t + g(\mathbf{x}_t, t) \Delta \omega\text{, with }\Delta \omega \sim \mathcal{N}(0, \Delta t)
\]</p>
<p>To find a connection now, it is only a matter of comparing Eq. 6 with Eq. 4. The sampling process defined by Langevin Dynamics is essentially an SDE discretized in time with</p>
\[\Delta t = 1 \\
f(\mathbf{x}_t, t) = \alpha_t \cdot \mathbf{s}_{\theta^*}(\mathbf{x}_t) \\
g(\mathbf{x}_t, t) = \sqrt{2 \alpha_t} \\
\Delta \omega \equiv \mathbf{z}\]
<hr />
<p>In another future article, we will explore Diffusion Models along with their connection to SDEs and how we can utilize it to create better generative models.</p>Ayan DasGenerative Models are of immense interest in fundamental research due to their ability to model the “all important” data distribution. A large class of generative models fall into the category of Probabilistic Graphical Models or PGMs. PGMs (e.g. VAE) usually train a parametric distribution (encoded in the form of graph structure) by minimizing log-likelihood, and samples from it by virtue of ancestral sampling. GANs, another class of popular generative model, take a different approach for training as well as sampling. Both class of models however, suffer from several drawbacks, e.g. difficult log-likelihood computation, unstable training etc. Recently, efforts have been made to craft generative models that inherit all good qualities from the existing ones. One of the rising classes of generative models is called “Score based Models”. Rather than explicitly maximizing the log-likelihood of a parametric density, it creates a map to navigate the data space. Once learned, sampling is done by Langevin Dynamics, an MCMC based method that actually navigates the data space using the map and lands on regions with high probability under empirical data distribution (i.e. real data regions). In this articles, we will describe the fundamentals of Score based models along with few of its variants.anyx: Build vector animations from programmatic descriptions2021-05-01T00:00:00+00:002021-05-01T00:00:00+00:00https://ayandas.me/projs/2021/05/01/anyx<p>Project <code class="language-plaintext highlighter-rouge">anyx</code> (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although <code class="language-plaintext highlighter-rouge">anyx</code> is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. <code class="language-plaintext highlighter-rouge">anyx</code> is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like <a href="https://www.pygame.org/news">pygame</a>, <code class="language-plaintext highlighter-rouge">anyx</code> allows users to simply write a <em>description</em> of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of <code class="language-plaintext highlighter-rouge">anyx</code> is motivated largely by a similar project called <a href="https://github.com/3b1b/manim">manim</a>.</p>
<p><a href="https://ayandas.me/anyx" target="_blank" class="fa fa-github fa-3x" style="float: left; margin-right: 20px;"></a></p>
<h2 id="i-work-on-this-project-only-in-my-spare-time-and-its-not-done-yet-read-a-brief-description-by-clicking-on-the-github-icon">I work on this project only in my spare time and its not done yet. Read a brief description by clicking on the github icon.</h2>Ayan DasProject anyx (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although anyx is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. anyx is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like pygame, anyx allows users to simply write a description of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of anyx is motivated largely by a similar project called manim.Cloud2Curve: Generation and Vectorization of Parametric Sketches2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://ayandas.me/pubs/2021/03/01/pub-9<center>
<a target="_blank" class="pubicon" href="https://arxiv.org/pdf/2103.15536.pdf">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/9.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. We further aim to model sketches as a sequence of low-dimensional parametric curves. To this end, we propose an inverse graphics framework capable of approximating a raster or waypoint based stroke encoded as a point-cloud with a variable-degree Bézier curve. Building on this module, we present Cloud2Curve, a generative model for scalable high-resolution vector sketches that can be trained end-to-end using point-cloud data alone. As a consequence, our model is also capable of deterministic vectorization which can map novel raster or waypoint based sketches to their corresponding high-resolution scalable Bézier equivalent. We evaluate the generation and vectorization capabilities of our model on Quick, Draw! and K-MNIST datasets.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my CVPR '21 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/ff2a87e58efe4d72a32f008e53826776" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk at CVPR 2021</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/H8-ejwYk7PY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{das2021cloud2curve,
title={Cloud2Curve: Generation and Vectorization of Parametric Sketches},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
year={2021},
eprint={2103.15536},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
</code></pre></div></div>Ayan DasPaperDifferentiable Programming: Computing source-code derivatives2020-09-08T00:00:00+00:002020-09-08T00:00:00+00:00https://ayandas.me/blog-tut/2020/09/08/differentiable-programming<p>If you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like <a href="https://www.facebook.com/yann.lecun/posts/10155003011462143">Yann LeCun</a>, <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Andrej Karpathy</a>) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a <em>recorder</em> (often called “Tape”) that builds a computation graph <em>at runtime</em> and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an <em>implementation perspective</em> - it doesn’t really “propagate” anything. It consumes a “program” in the form of <em>source code</em> and produces the “Derivative program” (also source code) w.r.t its inputs without <em>ever actually running</em> either of them. Additionally, DiffProg allows users the flexibility to write <em>arbitrary programs</em> without constraining it to any “guidelines”.
In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>”, written in <a href="https://julialang.org/">Julia</a>) gaining popularity in the Deep Learning community.</p>
<h2 id="why-need-derivatives-in-dl-">Why need Derivatives in DL ?</h2>
<p>This is easy to answer but just for the sake of completeness - we are interested in computing derivatives of a function because of its requirement in the update rule of Gradient Descent (or any of its successor):</p>
<p>\[
\Theta := \Theta - \alpha \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta}
\]</p>
<p>Where \(\Theta\) is the set of all parameters, \(\mathcal{D}\) is the data and \(F(\Theta)\) is the function (typically loss) we want to differentiate. Our ultimate goal is to compute \(\displaystyle{ \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta} }\) given the <em>structural form</em> of \(F\). The standard way of doing this is to use “Automatic Differentiation” (AutoDiff or AD), or rather, a special case of it called “Backpropagation”. It is called Backpropagation only when the function \(F(\cdot)\) is scalar, which is mostly true in cases we care about.</p>
<h2 id="pullback-functions--backpropagation">“Pullback” functions & Backpropagation</h2>
<p>We will now see how gradients of a complex function (given its full specification) can be computed as a sequence of primitive operations. Let’s explain this with an example for simplicity: We have two inputs \(a, b\) (just symbols) and a description of the <em>scalar</em> function we want to differentiate:</p>
<p>\[
\displaystyle{F(a, b) = \frac{a}{1+b^2}}
\]</p>
<p>We can think of \(F(a, b)\) as a series of smaller computations with intermediate results, like this</p>
\[\begin{align}
y_1 &← pow(b, 2) \\
y_2 &← add(1, y_1) \\
y_3 &← div(a, y_2)
\end{align}\]
<p>I changed the pure math notations to more programmatic ones; but the meaning remains same. In order to compute gradients, we <em>augment</em> these computations and create something called a “pullback” function as an additional by-product.</p>
<p>Mathematically, the actual computation and pullback creation can be written together symbolically as:</p>
\[\tag{1}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(pow, b, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(add, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(div, a, y_2)
\end{align}\]
<p>You can think of the <em>functional</em> \(\mathcal{J}\) as the internals of the Backpropagation framework which mutates all the computation units to produce an extra entity. A pullback function (\(\mathcal{B}_i\)) is a function that takes input the gradient w.r.t the output of the corresponding function and returns the gradient w.r.t inputs of the function:</p>
<p>\[
\mathcal{B}_i : \overline{y}_i \rightarrow \overline{input_1}, \overline{input_2}, \cdots
\]</p>
<p>It is really nothing but a different view of the chain-rule of differentiation:</p>
\[\begin{align}
\frac{\partial F}{\partial b} &\leftarrow \mathcal{B}_1(\frac{\partial F}{\partial y_1}) \triangleq \frac{\partial F}{\partial y_1} \cdot \frac{\partial y_1}{\partial b} \\
\overline{b} &\leftarrow \mathcal{B}_1( \overline{y}_1 ) \triangleq \overline{y}_1 \cdot \frac{\partial y_1}{\partial b}\left[ \text{Denoting } \frac{\partial F}{\partial x}\text{ as }\overline{x} \right]
\end{align}\]
<p>We must also realize that computing \(\mathcal{B}_i\) may require values from the forward pass. For example, computing \(\overline{b}\) may need evaluating \(\displaystyle{ \frac{\partial y_1}{\partial b} }\) at the given value of \(b\). After getting access to \(\mathcal{B}_i\), we can compute the derivatives of \(F\) w.r.t \(a, b\) by invoking the pullback functions in proper (reverse) order</p>
\[\begin{align}
\overline{a}, \overline{y_2} &\leftarrow \mathcal{B}_3(\overline{y}_3) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\overline{b} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>Please note that \(y_3\) is actually \(F\) and hence \(\overline{y}_3 ≜ \displaystyle{ \frac{\partial F}{\partial y_3} = 1 }\).</p>
<h2 id="1-tape-based-implementation">1. Tape-based implementation</h2>
<p>There are couple of different ways of implementing the theory described above. The de-facto way of doing it (as of this day) is something known as “tape-based” implementation. <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow Eager Execution</code> are probably the most popular example of this type.</p>
<p>In tape-based systems, the function \(F(..)\) is specified by its full structural form. Moreover, it requires <em>runtime execution</em> in order to compute anything (be it output or derivatives). Such system keeps track of every computation via a recorder or “tape” (that’s why the name) and builds an internal computation graph. Later, when requested, the tape stops recording and works its way backwards through the recorded tape to compute derivatives.</p>
<h3 id="the-specification-of-ftheta">The specification of \(F(\Theta)\)</h3>
<p>A tape-based system requires users to provide the function \(F\) as a description of its computations following a certain guidelines. These guidelines are provided by the specific AutoDiff framework we use. Take <code class="language-plaintext highlighter-rouge">PyTorch</code> for example - we write the series of computations using the API provided by <code class="language-plaintext highlighter-rouge">PyTorch</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">b0</span> <span class="o">=</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">b0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">y1</span><span class="p">)</span>
<span class="n">y3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">div</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">y3</span>
</code></pre></div></div>
<p>Think of the framework as an entity which is solely responsible for doing all the derivative computations. You just can’t be careless to use <code class="language-plaintext highlighter-rouge">math.sum()</code> (or anything) instead <code class="language-plaintext highlighter-rouge">torch.sum()</code>, or omit the base class <code class="language-plaintext highlighter-rouge">torch.nn.Module</code>. You have to stick to the guidelines <code class="language-plaintext highlighter-rouge">PyTorch</code> laid out to be able to make use of it. When done with the definition, we can run forward and backward pass like using actual data \((a_0, b_0)\)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>model = Network(...)
F = model(a0)
F.backward()
# 'model.b0.grad' & 'a0.grad' available
</code></pre></div></div>
<p>This will cause the framework to trigger the following sequence of computations one after another</p>
\[\tag{2}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(\mathrm{torch.pow}, \mathbf{b_0}, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(\mathrm{torch.sum}, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(\mathrm{torch.div}, \mathbf{a_0}, y_2) \\
\left[ \overline{a}\right]_{a=\mathbf{a_0}}, \overline{y_2} &\leftarrow \mathcal{B}_3(1) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\left[ \overline{b}\right]_{b=\mathbf{b_0}} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>The first and last 3 lines of computation are the “forward pass” and the “backward pass” of the model respectively. Frameworks like <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow</code> typically works in this way when <code class="language-plaintext highlighter-rouge">.forward()</code> and <code class="language-plaintext highlighter-rouge">.backward()</code> calls are made in succession. Point to be noted that since we are explicitly executing a forward pass, it will cache the necessary values required for executing the pullbacks in the backward pass. An overall diagram is shown below for clarification.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/18/tape_based.png" />
<figcaption>Fig.1: Overall pipeline of tape-based backpropagation. Green arrows indicate pullback creation by the framework and magenta arrows denote the runtime execution flow. </figcaption>
</figure>
</center>
<h3 id="whats-the-problem-">What’s the problem ?</h3>
<p>As of now, it might not seem that big of a problem for regular PyTorch user (me included). The problem intensifies when you have a non-ML code base with a complicated physics model (for example) like this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">from</span> <span class="nn">other_part_of_my_model</span> <span class="kn">import</span> <span class="n">sub_part</span>
<span class="k">def</span> <span class="nf">helper_function</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="n">something</span><span class="p">:</span>
<span class="k">return</span> <span class="n">helper_function</span><span class="p">(</span><span class="n">sub_part</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="c1"># recursive call
</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">complex_physics_model</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="n">math</span><span class="p">.</span><span class="n">do_awesome_thing</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="n">helper_function</span><span class="p">(</span><span class="nb">input</span><span class="p">))</span>
<span class="p">...</span>
<span class="k">return</span> <span class="n">output</span>
</code></pre></div></div>
<p>.. and you want to use it within your <code class="language-plaintext highlighter-rouge">PyTorch</code> model and differentiate it. There is no way you can do this so easily without spending your time to <code class="language-plaintext highlighter-rouge">PyTorch</code>-ify it first.</p>
<p>There is another serious problem with this approach: the framework cannot “<em>see</em>” any computation ahead of time. For example, when the execution thread reaches the <code class="language-plaintext highlighter-rouge">torch.sum()</code> function, it has no idea that it is about to encounter <code class="language-plaintext highlighter-rouge">torch.div()</code>. The reason its important is because the framework has no way of optimizing the computation - it <em>has to</em> execute the exact sequence of computations verbatim. For example, if the function description is given as \(\displaystyle{ F(a, b) = \frac{(a + ab)}{(1 + b)} }\), this type of framework will waste its resources executing lots of operations which will ultimately yield (both in forward and backward direction) something trivial.</p>
<h2 id="2-differentiable-programming">2. Differentiable Programming</h2>
<p>Differentiable Programming (DiffProg) offers a very elegant solution to both the problems described in the previous section. <strong>DiffProg allows you to write arbitrary code <em>without following any guidelines</em> and still be able to differentiate it.</strong> At the current state of DiffProg, majority of the successful systems use something called “<em>source code transformation</em>” in order to achieve its objective.</p>
<p>Source code transformation is a technique used extensively in the field of Compiler Designing. It takes a piece of code written in some high-level language (like C++, Python etc.) and emits a <em>compiled</em> version of it typically in a relatively lower level language (like Assembly, Bytecode, IRs etc.). Specifically, the input to a DiffProg system is a description of \(y ← F(\Theta)\) as <em>source code</em> written in some language with defined input/output. The output of the system is the source code of the derivative of \(F(\Theta)\) w.r.t its inputs (i.e., \(\overline{\Theta} ← F'(\overline{y})\)). The input program has full liberty to use the native primitives of the programming language like built-in functions, conditional statements, recursion, <code class="language-plaintext highlighter-rouge">struct</code>-like facilities, memory read/write constructs and pretty much anything that the language offers.</p>
<p>Using our generic notation, we can write down such a system as</p>
\[y, \mathcal{B} \leftarrow \mathcal{J}(F, \Theta)\]
<p>where \(F\) and \(\mathcal{B}:\overline{y}\rightarrow \overline{\Theta}\) are the given function and its derivative function in the form of <em>source codes</em> (bare with me if it doesn’t make sense at this point). Just like before, the <em>source code</em> for pullback \(\mathcal{B}\) may require some intermediate variables from that of \(y\). For a concrete example, the following is be a (hypothetical) valid DiffProg system:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="c1"># following string contains the source code of F(.)
</span><span class="o">>>></span> <span class="n">input_prog</span> <span class="o">=</span> <span class="s">"""
def F(a, b):
y1 = b ** 2
y2 = 1 + y1
return a / y2
"""</span>
<span class="o">>>></span> <span class="n">y</span><span class="p">,</span> <span class="n">B</span> <span class="o">=</span> <span class="n">diff_prog</span><span class="p">(</span><span class="n">input_prog</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="mf">1.</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="mf">2.</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="mf">0.2</span>
<span class="o">>>></span> <span class="k">exec</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="c1"># get the derivative function as a live object in current session
</span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">dF</span><span class="p">(</span><span class="mf">1.</span><span class="p">))</span> <span class="c1"># 'df()' is defined in source code 'B'
</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.16</span><span class="p">)</span>
</code></pre></div></div>
<p>Please pay attention to the fact that both our problems discussed in tape-based system are effectively solved now:</p>
<ol>
<li>
<p>We no longer need to be under the umbrella of a framework as we can directly work with native code. In the above example, the source code of the given function is simply written in native python. The example shows the overall pullback source-code (i.e., <code class="language-plaintext highlighter-rouge">B</code>) and also its explicitly compiled form (i.e., <code class="language-plaintext highlighter-rouge">dF</code>). Optionally, a DiffProg system can produce readily compiled derivative function.</p>
</li>
<li>
<p>The DiffProg system can “see” the whole source-code at once, giving it the opportunity to run various optimizations. As a result, both the given program the derivative program could be much faster than the standard tape-based approaches.</p>
</li>
</ol>
<p>Although I showed the examples in Python for ease of understanding but it doesn’t really have to be Python. The theory of DiffProg is very general and can be adopted to any language. In fact, Python is NOT the language of choice for some of the first successful DiffProg systems. The one we are gonna talk about is written in a relatively new language called <a href="http://julialang.org/">Julia</a>. The Julia language and its compiler provides an excellent support for meta-programming, i.e. manipulating/analysing/constructing Julia programs within itself. This allows Julia to offer a DiffProg system that is much more flexible than naively parsing strings (like the toy example shown above). We will look into the specifics of the Julia language and its DiffProg framework called “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>” later in this article. But before that, we will look at few details about the general compiler machinery that is required to implement DiffProg systems. Since this article is mostly targetted towards people from ML/DL background, I will try my best to be reasonable about the details of compiler designing.</p>
<h3 id="static-single-assignment-ssa-form">Static Single Assignment (SSA) form</h3>
<p>A compiler (or compiler-like system) analyses a given source code by parsing it as string. Then, it creates a large and complex data structure (known as AST) containing control flow, conditionals and every fundamental language constructs. Such structure is further compiled down to a relatively low-level representation comprising the core flow of a source program. This low-level code is known as the “Intermediate Representation (IR)”.
One of its fundamental purpose is to replaces all unique variable names with a unique ID. A given source code like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(a, b)
y1 = b ^ 2
y1 = 1 + y1
return a / y1
</code></pre></div></div>
<p>a compiler can turn it into an IR (hypothetical) like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%3 = 1 + %3
return %1 / %3
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">%N</code> is a unique placeholder for a variable. However, this particular form is a little inconvenient to analyse in practice due to the possibility of a symbol redefinition (e.g. the variable <code class="language-plaintext highlighter-rouge">y1</code> in above example). Modern compilers (including Julia) use a little improved IR, called “<em>SSA (Static Single Assignment) form IR</em>” which assigns one variable only once and often introduces extra unique symbols in order to achieve that.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%4 = 1 + %3
return %1 / %4
</code></pre></div></div>
<p>Please note how it used an extra unique ID (i.e. <code class="language-plaintext highlighter-rouge">%4</code>) in order to avoid re-assignment (of <code class="language-plaintext highlighter-rouge">%3</code>).
It has been shown that such SSA-form IR (rather than direct source code) can be differentiated, and a corresponding “Derivative IR” can be retrieved. The obvious way of crafting the derivative IR of \(F\) is to use the Derivative IRs of its constituent operations, similar to what is done in tape-based method. The biggest difference is the fact that everything is now in terms of source codes (or rather IR to be precise). The compiler could craft the derivative program like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function dF(%1, %2)
# IR for forward pass
%3, B1 = J(pow, %2, 2)
%4, B2 = J(add, 1, %3)
_, B3 = J(div, %1, %4)
# IR for backward pass
%5, %6 = B3(1)
%7 = B2(%6)
%8 = B1(%7)
return %5, %8
</code></pre></div></div>
<p>The structure of the above code may resemble the sequence of computations in Eq.2, but its very different in terms of implementation (Refer to Fig.2 below). The code (IR) is constructed at compile time by a compiler-like framework (the DiffProg system). The derivative IR is then passed onto an IR-optimizer which can squeeze its guts by enabling various optimization like dead-code elimination, operation-fusion and more advanced ones. And finally compiling it down to machine code.</p>
<center>
<figure>
<img width="90%" style="padding-top: 20px;" src="/public/posts_res/18/diff_prog.png" />
<figcaption>Fig.2: Overall pipeline of a DiffProg system with source-code transformation. Green arrows indicate creation of pullback codes by the framework and magenta arrows denote composition of source code blocks. After compiler optimization, the whole source code is squeezed into highly efficient object code. </figcaption>
</figure>
</center>
<h2 id="zygote-a-diffprog-framework-for-julia"><code class="language-plaintext highlighter-rouge">Zygote</code>: A DiffProg framework for Julia</h2>
<p>Julia is a particularly interesting language when it comes to implementing a DiffProg framework. There are solid reasons why <code class="language-plaintext highlighter-rouge">Zygote</code>, one of the most successful DiffProg frameworks is written in Julia. I will try to articulate few of them below:</p>
<ol>
<li>
<p><strong>Just-In-Time (JIT) compiler:</strong> Julia’s efficient Just-in-time (JIT) compiler compiles one statement at a time and run it immediately before moving on to the next one, achieving a striking balance between interpreted and compiled languages.</p>
</li>
<li>
<p><strong>Type specialization:</strong> Julia allows writing generic/optional/loosely-typed functions that can later be instantiated using concrete types. High-density computations specifically benefit from it by casting every computation in terms of <code class="language-plaintext highlighter-rouge">Float32/Float64</code> which can in turn produce specialized instructions (e.g. <code class="language-plaintext highlighter-rouge">AVX</code>, <code class="language-plaintext highlighter-rouge">MMX</code>, <code class="language-plaintext highlighter-rouge">AVX2</code>) for modern CPUs.</p>
</li>
<li>
<p><strong>Pre-compilation:</strong> The peculiar feature that benefits <code class="language-plaintext highlighter-rouge">Zygote</code> the most is Julia’s nature of keeping track of the compilations that have already been done in the current session and DOES NOT do them again. Since DL/ML is all about computing gradients over and over again, <code class="language-plaintext highlighter-rouge">Zygote</code> compiles and optimizes the derivative program (IR) just once and runs it repeatedly (which is blazingly fast) with different value of parameters.</p>
</li>
<li>
<p><strong>LLVM IR:</strong> Julia uses <a href="https://llvm.org/">LLVM compiler infrastructure</a> as its backend and hence emits the LLVM IR known to be highly performant and used by many other prominent languages.</p>
</li>
</ol>
<p>Now, let’s see <code class="language-plaintext highlighter-rouge">Zygote</code>’s primary API, which is surprisingly simple. The central API of <code class="language-plaintext highlighter-rouge">Zygote</code> is the function <code class="language-plaintext highlighter-rouge">Zygote.gradient(..)</code> with its first argument being the function to be differentiated (written in native Julia) followed by its argument at which gradient to be computed.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="k">using</span> <span class="n">Zygote</span>
<span class="n">julia</span><span class="o">></span> <span class="k">function</span><span class="nf"> F</span><span class="x">(</span><span class="n">x</span><span class="x">)</span>
<span class="k">return</span> <span class="mi">3</span><span class="n">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">2</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">julia</span><span class="o">></span> <span class="n">gradient</span><span class="x">(</span><span class="n">F</span><span class="x">,</span> <span class="mi">5</span><span class="x">)</span>
<span class="x">(</span><span class="mi">32</span><span class="x">,)</span>
</code></pre></div></div>
<p>That is basically computing \(\displaystyle{ \left[ \frac{\partial F}{\partial x} \right]_{x=5} }\) for \(F(x) = 3x^2 + 2x + 1\).</p>
<p>For debugging purpose, we can see the <em>actual</em> LLVM IR code for a given function and its pullback. The actual IR is a bit more complex in reality than what I showed but similar in high-level structure. We can peek into the IR of the above function</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_ir</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">3</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">4</span> <span class="o">=</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">)()</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">literal_pow</span><span class="x">(</span><span class="n">Main</span><span class="o">.:^</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="o">%</span><span class="mi">5</span>
<span class="o">%</span><span class="mi">7</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="o">%</span><span class="mi">2</span>
<span class="o">%</span><span class="mi">8</span> <span class="o">=</span> <span class="o">%</span><span class="mi">6</span> <span class="o">+</span> <span class="o">%</span><span class="mi">7</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">8</span>
</code></pre></div></div>
<p>.. and also its “Adjoint”. The adjoint in Zygote is basically the mathematical functional \(\mathcal{J}(\cdot)\) that we’ve been seeing all along.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_adjoint</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="n">Zygote</span><span class="o">.</span><span class="kt">Adjoint</span><span class="x">(</span><span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span> <span class="o">::</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">Context</span><span class="x">,</span> <span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">_pullback</span><span class="x">(</span><span class="o">%</span><span class="mi">4</span><span class="x">,</span> <span class="o">%</span><span class="mi">5</span><span class="x">)</span>
<span class="o">...</span>
<span class="c"># please run yourself to see the full code</span>
<span class="o">...</span>
<span class="o">%</span><span class="mi">13</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradindex</span><span class="x">(</span><span class="o">%</span><span class="mi">12</span><span class="x">,</span> <span class="mi">1</span><span class="x">)</span>
<span class="o">%</span><span class="mi">14</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">accum</span><span class="x">(</span><span class="o">%</span><span class="mi">6</span><span class="x">,</span> <span class="o">%</span><span class="mi">10</span><span class="x">)</span>
<span class="o">%</span><span class="mi">15</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">tuple</span><span class="x">(</span><span class="nb">nothing</span><span class="x">,</span> <span class="o">%</span><span class="mi">14</span><span class="x">)</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">15</span><span class="x">)</span>
</code></pre></div></div>
<p>I have established throughout this article that the function \(F(x)\) can literally be any arbitrary program written in native Julia using standard language features.
Let’s see another toy (but meaningful) program using more flexible Julia code.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span><span class="nc"> Point</span>
<span class="n">x</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">y</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">Point</span><span class="x">(</span><span class="n">x</span><span class="o">::</span><span class="kt">Real</span><span class="x">,</span> <span class="n">y</span><span class="o">::</span><span class="kt">Real</span><span class="x">)</span> <span class="o">=</span> <span class="n">new</span><span class="x">(</span><span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">x</span><span class="x">),</span> <span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">y</span><span class="x">))</span>
<span class="k">end</span>
<span class="c"># Define operator overloads for '+', '*', etc.</span>
<span class="k">function</span><span class="nf"> distance</span><span class="x">(</span><span class="n">p₁</span><span class="o">::</span><span class="n">Point</span><span class="x">,</span> <span class="n">p₂</span><span class="o">::</span><span class="n">Point</span><span class="x">)</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">δp</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">-</span> <span class="n">p₂</span>
<span class="n">norm</span><span class="x">([</span><span class="n">δp</span><span class="o">.</span><span class="n">x</span><span class="x">,</span> <span class="n">δp</span><span class="o">.</span><span class="n">y</span><span class="x">])</span>
<span class="k">end</span>
<span class="n">p₁</span><span class="x">,</span> <span class="n">p₂</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span> <span class="mf">3.0</span><span class="x">),</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mf">2.</span><span class="x">,</span> <span class="mi">0</span><span class="x">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mi">1</span><span class="o">//</span><span class="mi">3</span><span class="x">,</span> <span class="mf">1.0</span><span class="x">)</span>
<span class="c"># initial parameters</span>
<span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span> <span class="o">=</span> <span class="mf">0.1</span><span class="x">,</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="o">:</span><span class="mi">1000</span> <span class="c"># no. of epochs</span>
<span class="c"># compute gradients</span>
<span class="nd">@time</span> <span class="n">δK₁</span><span class="x">,</span> <span class="n">δK₂</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradient</span><span class="x">(</span><span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span><span class="x">)</span> <span class="k">do</span> <span class="n">k₁</span><span class="o">::</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">k₂</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">p̂</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">*</span> <span class="n">k₁</span> <span class="o">+</span> <span class="n">p₂</span> <span class="o">*</span> <span class="n">k₂</span>
<span class="n">distance</span><span class="x">(</span><span class="n">p̂</span><span class="x">,</span> <span class="n">p</span><span class="x">)</span> <span class="c"># scalar output of the function</span>
<span class="k">end</span>
<span class="c"># update parameters</span>
<span class="n">K₁</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₁</span>
<span class="n">K₂</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₂</span>
<span class="k">end</span>
<span class="nd">@show</span> <span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span>
<span class="c"># shows "(K₁, K₂) = (0.33427804653861276, 0.4996408206795386)"</span>
</code></pre></div></div>
<p>The above program is basically solving the following (pretty simple) problem</p>
\[\begin{align}
\arg\min_{K_1,K_2} &\vert\vert \widehat{p}(K_1,K_2) - p \vert\vert_2 \\
\text{with }&\widehat{p}(K_1,K_2) ≜ p_1 \cdot K_1 + p_2 \cdot K_2
\end{align}\]
<p>where \(p=[-1/3, 1]^T, p_1=[2,3]^T\) and \(p_2=[-2,0]^T\). By choosing these specific numbers, I guaranteed that there is a solution for \(K_1,K_2\).</p>
<p>If you look at the program at a glance, you would notice that the whole program is almost entirely written in native Julia using structure (<code class="language-plaintext highlighter-rouge">struct Point</code>), built-in function (<code class="language-plaintext highlighter-rouge">norm()</code>, <code class="language-plaintext highlighter-rouge">convert()</code>), memory access constructs (<code class="language-plaintext highlighter-rouge">δp.x</code>, <code class="language-plaintext highlighter-rouge">δp.y</code>) etc. The only usage of Zygote is that <code class="language-plaintext highlighter-rouge">Zygote.gradient()</code> call in the heart of the loop. BTW, I omitted the operator overloading functions for space restrictions.</p>
<p>I am not showing the IR codes for this one; you are free to execute <code class="language-plaintext highlighter-rouge">@code_ir</code> and <code class="language-plaintext highlighter-rouge">@code_adjoint</code> on the function implicitly defined using the <code class="language-plaintext highlighter-rouge">do .. end</code>. One thing I would like to show is the execution speed and my earlier argument about “precompilation”. The time measuring macro (<code class="language-plaintext highlighter-rouge">@time</code>) shows this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 11.764279 seconds (26.50 M allocations: 1.342 GiB, 4.58% gc time)
0.000025 seconds (44 allocations: 2.062 KiB)
0.000026 seconds (44 allocations: 2.062 KiB)
0.000007 seconds (44 allocations: 2.062 KiB)
0.000006 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
</code></pre></div></div>
<p>Did you see how the execution time reduced by an astonishingly high margin ? That’s Julia’s precompilation at work - it compiled the derivative program only once (on its first encounter) and produces highly efficient code to be reused later. It might not be as big a surprise if you already know Julia, but it is definitely a huge advantage for a DiffProg framework.</p>
<hr />
<p>Okay, that’s about it today. See you next time. The following references have been used for preparing this article:</p>
<ol>
<li>“Don’t Unroll Adjoint: Differentiating SSA-Form Programs”, Michael Innes, <a href="https://arxiv.org/abs/1810.07951">arXiv/1810.07951</a>.</li>
<li><a href="https://www.youtube.com/watch?v=LjWzgTPFu14">Talk</a> by Michael Innes @ Julia london user group meetup.</li>
<li><a href="https://www.youtube.com/watch?v=Sv3d0k7wWHk">Talk</a> by Elliot Saba & Viral Shah @ Microsoft research.</li>
<li><a href="https://fluxml.ai/Zygote.jl/latest/">Zygote’s documentation</a> & <a href="https://docs.julialang.org/en/v1/">Julia’s documentation</a>.</li>
</ol>Ayan DasIf you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like Yann LeCun, Andrej Karpathy) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a recorder (often called “Tape”) that builds a computation graph at runtime and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an implementation perspective - it doesn’t really “propagate” anything. It consumes a “program” in the form of source code and produces the “Derivative program” (also source code) w.r.t its inputs without ever actually running either of them. Additionally, DiffProg allows users the flexibility to write arbitrary programs without constraining it to any “guidelines”. In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “Zygote”, written in Julia) gaining popularity in the Deep Learning community.Energy Based Models (EBMs): A comprehensive introduction2020-08-13T00:00:00+00:002020-08-13T00:00:00+00:00https://ayandas.me/blog-tut/2020/08/13/energy-based-models-one<p>We talked extensively about <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a> in my earlier article and also described <a href="/blog-tut/2020/01/01/variational-autoencoder.html">one particular model</a> following the principles of Variational Inference (VI). There exists another class of models conveniently represented by <em>Undirected</em> Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as <strong>Energy Based Models (EBM)</strong>, as we shall see, they rely on something called <em>Energy Functions</em>. In the early days of this Deep Learning <em>renaissance</em>, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as <strong>Boltmann Machines (BM)</strong> which are very well known in the literature.</p>
<h2 id="undirected-graphical-models">Undirected Graphical Models</h2>
<p>The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a>. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.</p>
<center>
<figure>
<img width="65%" style="padding-top: 20px;" src="/public/posts_res/17/undir_models.png" />
<figcaption>Fig.1: (a) An atom lattice model. (b) An arbitrary undirected model.</figcaption>
</figure>
</center>
<p>We model a set of random variables \(\mathbf{X}\) (in our example, \(\{ A,B,C,D \}\)) whose connections are defined by graph \(\mathcal{G}\) and have <em>“potential functions”</em> defined on each of its maximal <a href="https://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a> \(\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})\). The total potential of the graph is defined as</p>
<p>\[
\Phi(\mathbf{x}) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q)
\]</p>
<p>\(q\) is an arbitrary instantiation of the set of RVs denoted by \(\mathcal{Q}\). The potential functions \(\phi_{\mathcal{Q}}(q)\in\mathbb{R}_{>0}\) are basically “affinity” functions on the state space of the cliques, e.g. given a state \(q\) of a clique \(\mathcal{Q}\), the corresponding potential function \(\phi_{\mathcal{Q}}(q)\) returns the <em>viability of that state</em> OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are <em>arbitrary non-negative values</em>. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as \(\displaystyle{ \Phi(a,b,c,d) = \phi_{\{A,B,C\}}(a,b,c)\cdot \phi_{\{A,D\}}(a,d) }\). If we assume the variables \(\{ A,D \}\) are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:</p>
\[\phi_{\{A,D\}}(a=0,d=0) = +4.00 \\
\phi_{\{A,D\}}(a=0,d=1) = +0.23 \\
\phi_{\{A,D\}}(a=1,d=0) = +5.00 \\
\phi_{\{A,D\}}(a=1,d=1) = +9.45\]
<p>Just like every other model in machine learning, the potential functions can be parameterized, leading to</p>
<p>\[ \tag{1}
\Phi(\mathbf{x}; \Theta) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})
\]</p>
<p>Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.</p>
<h2 id="reparameterizing-in-terms-of-energy">Reparameterizing in terms of Energy</h2>
<p>When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of <strong>energy</strong> functions \(E_{\mathcal{Q}}\) where</p>
<p>\[\tag{2}
\phi_{\mathcal{Q}}(q, \Theta_{\mathcal{Q}}) = \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}))
\]</p>
<p>The \(\exp(\cdot)\) enforces the potentials to be always non-negative and thus we are free to choose an <em>unconstrained</em> energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is <em>reverts the semantic meaning</em> of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are <em>bad</em>, i.e. less energy means a stable system.</p>
<p>Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields</p>
\[\begin{align}
\Phi(\mathbf{x}; \Theta) &= \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})) \\
\tag{3}
&= \exp\left(-\sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})\right) =
\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta))
\end{align}\]
<p>Here we defined \({ E_{\mathcal{G}}(\mathbf{x}; \Theta) \triangleq \sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}) }\) to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph <em>from multiplicative (Eq.1) to additive (Eq.3)</em>. This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.</p>
<p>All this is fine .. well .. unless we need to do things like <em>sampling</em>, <em>computing log-likelihood</em> etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.</p>
<h2 id="back-to-probabilities">Back to Probabilities</h2>
<p>The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain</p>
\[\begin{align}
p(\mathbf{x}; \Theta) &= \frac{\Phi(\mathbf{x}; \Theta)}{
\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \Phi(\mathbf{x}'; \Theta)
} \\ \\
\tag{4}
&= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta)/\tau)}{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}(\mathbf{x}'; \Theta)/\tau)}\text{ (using Eq.3)}
\end{align}\]
<p>This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss \(\tau\) shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to <code class="language-plaintext highlighter-rouge">Boltzmann Distribution</code>. Here’s what the <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Wikipedia</a> says:</p>
<blockquote>
<p>In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …</p>
</blockquote>
<p>From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px;" src="/public/posts_res/17/energy_prob.png" />
<figcaption>Fig.2: An energy function and its corresponding probability distribution.</figcaption>
</figure>
</center>
<p>The denominator of Eq.4 is often known as the “Partition Function” (denoted as \(Z\)). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of \(\mathbf{X}\).</p>
<p>A hyper-parameter called “temperature” (denoted as \(\tau\)) is often introduced in Eq.4 which also has its roots in the original <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann Distribution from Physics</a>. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider \(\tau=1\) for the rest of the article.</p>
<h2 id="a-general-learning-algorithm">A general learning algorithm</h2>
<p>The question now is - how do I learn the model given a dataset ? Let’s say my dataset has \(N\) samples: \(\mathcal{D} = \{ x^{(i)} \}_{i=1}^N\). An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset</p>
\[\begin{align}
\mathcal{L}(\Theta; \mathcal{D}) = - \log \prod_{i=1}^N p(x^{(i)}; \Theta) &= \sum_{i=1}^N -\log p(x^{(i)}; \Theta) \\
&= \underbrace{\frac{1}{N}\sum_{i=1}^N}_{\text{expectation}} \left[ E_{\mathcal{G}}(x^{(i)}; \Theta) \right] + \log Z\\
&\text{(putting Eq.4 followed by trivial calculations, and}\\
&\text{ dividing loss by constant N doesn't affect optima)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\bigl[ E_{\mathcal{G}}(x; \Theta) \bigr] + \log Z
\end{align}\]
<p>Computing gradient w.r.t. parameters yields</p>
\[\begin{align}
\frac{\partial \mathcal{L}}{\partial \Theta} &= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{\partial}{\partial \Theta} \log Z \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{1}{Z} \frac{\partial}{\partial \Theta} \left[ \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}) \right]\text{ (using definition of Z)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \underbrace{\frac{1}{Z} \exp(-E_{\mathcal{G}})}_{\text{RHS of Eq.4}} \cdot \frac{\partial (-E_{\mathcal{G}})}{\partial \Theta}\\
&\text{ (Both Z and the partial operator are independent}\\
&\text{ of x and can be pushed inside the summation)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \underbrace{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} p(\mathbf{x}'; \Theta)}_{\text{expectation}} \cdot \frac{\partial E_{\mathcal{G}}}{\partial \Theta}\\
\tag{5}
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]
\end{align}\]
<p>Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the <em>data distribution</em> - essentially picking up data from our dataset. The second expectation is on the <em>model distribution</em> - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule</p>
\[\Theta := \Theta - \lambda\cdot\mathbb{E}_{x\sim\mathcal{D}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\text{, and }
\Theta := \Theta + \lambda\cdot\mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\]
<p>The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points <em>coming from data</em>. And the second one tries to maximize (notice the difference in sign) the energy function at points <em>coming from the model</em>. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. \(p_{\Theta} \approx p_{\mathcal{D}}\). At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm <em>pushes the energy down</em> at places where original data lies; it also <em>pull the energy up</em> at places which the <em>model thinks</em> original data lies.</p>
<center>
<figure>
<img width="95%" style="padding-top: 20px;" src="/public/posts_res/17/pos_neg_phase_diagram.png" />
<figcaption>Fig.3: (a) Model is being optimized. The arrows depict the "pulling up" and "pushing down" of energy landscape. (b) Model has converged to an optimum.</figcaption>
</figure>
</center>
<p>Whatever may be the interpretation, as I mentioned before that the denominator of \(p(\mathbf{x}; \Theta)\) (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.</p>
<h2 id="gibbs-sampling">Gibbs Sampling</h2>
<p>As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the <em>conditional densities</em> of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the \(Z\) cancels out. Conditional density of one variable (say \(X_j\)) given the others (let’s denote by \(X_{-j}\)) is:</p>
\[\tag{6}
p(x_j\vert \mathbf{x}_{-j}) = \frac{p(\mathbf{x})}{p(\mathbf{x}_{-j})}
= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}))}{\sum_{x_j} \exp(-E_{\mathcal{G}}(\mathbf{x}))}
\text{ (using Eq.4)}\]
<p>I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.</p>
<p>To sample \(\mathbf{x}\sim p_{\Theta}(\mathbf{x})\), we iteratively execute the following for \(T\) iterations</p>
<ol>
<li>We have a sample from last iteration \(t-1\) as \(\mathbf{x}^{(t-1)}\)</li>
<li>We then pick one variable \(X_j\) (in some order) at a time and sample from its conditional given the others: \(x_j^{(t)}\sim p(x_j\vert \underbrace{x_1^{(t)}, \cdots, x_{j-1}^{(t)}}_{\text{current iteration}}, \underbrace{x_{j+1}^{(t-1)}, \cdots, x_D^{(t-1)}}_{\text{previous iteration}})\). Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.</li>
</ol>
<p>We can start this process by setting \(\mathbf{x}^{(0)}\) to anything. If \(T\) is sufficiently large, the samples towards the end are true samples from the density \(p_{\Theta}\). To know it a bit more rigorously, I <strong>highly recommend</strong> to <a href="https://en.wikipedia.org/wiki/Gibbs_sampling#Implementation">go through this</a>.
You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">MCMC family algorithm</a> which has something called “Burn-in period”.</p>
<p>Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.</p>
<h2 id="boltzmann-machine">Boltzmann Machine</h2>
<p>Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector \(\mathbf{X}\in\{0,1\}^D\) with \(D\) components \([ X_1, X_2, \cdots, X_D ]\). All \(D\) RVs are connected to all others by an undirected graph \(\mathcal{G}\).</p>
<center>
<figure>
<img width="30%" style="padding-top: 20px;" src="/public/posts_res/17/bm_diagram.png" />
<figcaption>Fig.4: Undirected graph representing Boltzmann Machine</figcaption>
</figure>
</center>
<p>By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:</p>
\[\tag{7}
E_{\mathcal{G}}(\mathbf{x}; W) = - \frac{1}{2} \mathbf{x}^T W \mathbf{x}\]
<p>Upon expanding the vectorized form (reader is encouraged to try), we can see each term \(x_i\cdot W_{ij}\cdot x_j\) for all \(i\lt j\) as the contribution of pair of RVs \((X_i, X_j)\) to the whole energy function. \(W_{ij}\) is the “connection strength” between them. If a pair of RVs \((X_i, X_j)\) turn on together more often, a high value of \(W_{ij}\) would encourage reducing the total energy. So by means of learning, we expect to see \(W_{ij}\) going up if \((X_i, X_j)\) fire together. This phenomenon is the founding idea of a closely related learning strategy called <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian Learning</a>, proposed by Donald Hebb. Hebbian theory basically says:</p>
<blockquote>
<p>If fire together, then wire together</p>
</blockquote>
<p>How do we learn this model then ? We have already seen the general way of computing gradient. We have \(\displaystyle{ \frac{\partial E_{\mathcal{G}}}{\partial W} = -\mathbf{x}\mathbf{x}^T }\). So let’s use Eq.5 to derive a learning rule:</p>
\[W := W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{W})}[ -\mathbf{x}\mathbf{x}^T ] \right)\]
<p>Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:</p>
\[p(x_j = 1\vert \mathbf{x}_{-j}; W) = \sigma\left(W_{-j}^T\cdot \mathbf{x}_{-j}\right)\]
<p>where \(\sigma(\cdot)\) is the Sigmoid function and \(W_{-j}\in\mathbb{R}^{D-1}\) denote the vector of parameters connecting \(x_j\) with the rest of the variables \(\mathbf{x}_{-j}\in\mathbb{R}^{D-1}\). I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:</p>
<center>
<figure>
<img width="25%" style="padding-top: 20px;" src="/public/posts_res/17/bm_conditional.png" />
<figcaption>Fig.5: The computational view of BM showing its dependencies by arrows.</figcaption>
</figure>
</center>
<h2 id="boltzmann-machine-with-latent-variables">Boltzmann Machine with latent variables</h2>
<p>To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help <em>explaining</em> the visible variables (see my <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGM</a> article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).</p>
<p><strong>[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]</strong></p>
<center>
<figure>
<img width="70%" style="padding-top: 20px;" src="/public/posts_res/17/hbm_diagram.png" />
<figcaption>Fig.6: (a) Undirected graph of BM with hidden units (shaded ones are visible). (b) Computational view of the model while computing conditionals. </figcaption>
</figure>
</center>
<p>Suppose we have \(K\) hidden units and \(D\) visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden (\(W\in\mathbb{R}^{D\times K}\)), visible-visible (\(V\in\mathbb{R}^{D\times D}\)) and hidden-hidden (\(U\in\mathbb{R}^{K\times K}\)) interactions. We compactly represent them as \(\Theta \triangleq \{ W, U, V \}\).</p>
\[E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}; \Theta) = -\mathbf{x}^T W \mathbf{h} - \frac{1}{2} \mathbf{x}^T V \mathbf{x} - \frac{1}{2} \mathbf{h}^T U \mathbf{h}\]
<p>The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.</p>
\[p(\mathbf{x}; \Theta) = \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} p(\mathbf{x}, \mathbf{h}; \Theta)
= \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}))}{\sum_{\mathbf{x}',\mathbf{h}'\in\mathrm{Dom}(\mathbf{X}, \mathbf{H})} \exp(-E_{\mathcal{G}}(\mathbf{x}', \mathbf{h}'))}\]
<p>It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:</p>
\[\begin{align}
p(h_k\vert \mathbf{x}, \mathbf{h}_{-k}) = \sigma( W\cdot\mathbf{x} + U_{-k}\cdot\mathbf{h}_{-k} ) \\
p(x_j\vert \mathbf{h}, \mathbf{x}_{-j}) = \sigma( W\cdot\mathbf{h} + V_{-j}\cdot\mathbf{x}_{-j} )
\end{align}\]
<p>Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.</p>
<p>Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters</p>
\[\begin{align}
W &:= W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x,h}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{x,h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{h}^T ] \right)\\
V &:= V - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{x}^T ] \right)\\
U &:= U - \lambda \cdot \left( \mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}[ -\mathbf{h}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{h}\mathbf{h}^T ] \right)
\end{align}\]
<p>If you are paying attention, you might notice something strange .. how do we compute the terms \(\mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}\) (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors \(\mathbf{x}^{(i)}\) in dataset and we can get an approximate <em>complete data</em> (visible plus hidden) density as</p>
\[p_{\mathcal{D}}(\mathbf{x}^{(i)}, \mathbf{h}) = p_{\mathcal{D}}(\mathbf{x}^{(i)}) \cdot p_{\Theta}(\mathbf{h}\vert \mathbf{x}^{(i)})\]
<p>Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).</p>
<p>For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. <strong>There is a clever hack though</strong>. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “<a href="https://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf">Contrastive Divergence</a>” and has long been used in practical implementations.</p>
<h2 id="restricted-boltzmann-machine-rbm">“Restricted” Boltzmann Machine (RBM)</h2>
<p>Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.</p>
<p>RBM is basically same as Boltzmann Machine with hidden units but with <em>one big difference</em> - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.</p>
\[U = \mathbf{0}, V = \mathbf{0}\]
<p>If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of \(U\) and \(V\) from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/17/rbm_diag_and_cond.png" />
<figcaption>Fig.7: (a) Graphical diagram of RBM. (b) Arrows just show computation deps</figcaption>
</figure>
</center>
<p>Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.</p>
\[\begin{align}
p(h_k\vert \mathbf{x}) = \sigma( W_{[:,k]}\cdot\mathbf{x} )\\
p(x_j\vert \mathbf{h}) = \sigma( W_{[j,:]}\cdot\mathbf{h} )
\end{align}\]
<p>That means they can be computed in parallel</p>
\[\begin{align}
p(\mathbf{h}\vert \mathbf{x}) = \prod_{k=1}^K p(h_k\vert \mathbf{x}) = \sigma( W\cdot\mathbf{x} )\\
p(\mathbf{x}\vert \mathbf{h}) = \prod_{j=i}^D p(x_j\vert \mathbf{h}) = \sigma( W\cdot\mathbf{h} )
\end{align}\]
<p>Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:</p>
<ol>
<li>Sample a hidden vector \(\mathbf{h}^{(t)}\sim p(\mathbf{h}\vert \mathbf{x}^{(t-1)})\)</li>
<li>Sample a visible vector \(\mathbf{x}^{(t)}\sim p(\mathbf{x}\vert \mathbf{h}^{(t)})\)</li>
</ol>
<p>This makes RBM an attractive choice for practical implementation.</p>
<hr />
<p>Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://www.cs.toronto.edu/~hinton/csc321/readings/boltz321.pdf">Boltzmann Machine, by G. Hinton, 2007</a></li>
<li><a href="https://www.crim.ca/perso/patrick.kenny/BMNotes.pdf">Notes on Boltzmann Machine, by Patrick Kenny</a></li>
<li><a href="http://deeplearning.net/tutorial/rbm.html">deeplearning.net documentation</a></li>
<li><a href="https://www.youtube.com/watch?v=2fRnHVVLf1Y&list=PLiPvV5TNogxKKwvKb1RKwkq2hm7ZvpHz0">Hinton’s coursera course</a></li>
<li><a href="https://www.deeplearningbook.org/">Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville</a></li>
</ol>Ayan DasWe talked extensively about Directed PGMs in my earlier article and also described one particular model following the principles of Variational Inference (VI). There exists another class of models conveniently represented by Undirected Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as Energy Based Models (EBM), as we shall see, they rely on something called Energy Functions. In the early days of this Deep Learning renaissance, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as Boltmann Machines (BM) which are very well known in the literature.Pixelor: A Competitive Sketching AI Agent. So you think you can sketch?2020-07-30T00:00:00+00:002020-07-30T00:00:00+00:00https://ayandas.me/pubs/2020/07/30/pub-8<center>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/pdf/10.1145/3414685.3417840">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/abs/10.1145/3414685.3417840">
<i class="fa fa-files-o fa-3x"></i>Suppl.
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Demo</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="4">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/8.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my SIGGRAPH Asia 2020 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/159a510c082643ea89a012555fdfcc67" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk (15 mins) at SIGGRAPH Asia 2020</h2>
<iframe width="800" height="450" src="https://www.youtube.com/embed/oSk2x5HuCA8" frameborder="1" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<br />
<h4>Watch a <a href="https://www.youtube.com/watch?v=E_Aclms4g-w" target="_blank">short summary</a> video instead</h4>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<h2><a href="http://surrey.ac:9999/">Try out the Demo</a> (screenshot below)</h2>
<figure>
<img width="75%" src="/public/pub_res/8_2.gif" />
</figure>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="4">
<div class="accordion-item__container">
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/neuralsort-siggraph">
<i class="fa fa-file-code-o fa-3x"></i>NeuralSort repo
</a>
<a target="_blank" class="pubicon" href="https://github.com/AyanKumarBhunia/sketch-transformerMMD">
<i class="fa fa-file-code-o fa-3x"></i>Transformer MMD repo
</a>
<br />
<h2>The "SlowSketch" dataset</h2>
<img border="2px" width="80%" src="/public/pub_res/8_3.png" alt="SlowSketch" />
<a target="_blank" class="pubicon" href="https://drive.google.com/u/0/uc?export=download&confirm=n4LZ&id=1mWEY7vFkOw790DwUtqcTX8fHzNBP_b1J">
<i class="fa fa-database fa-3x"></i>SlowSketch
</a>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{pixelor20siga,
author = {Bhunia, Ayan Kumar and Das, Ayan and Muhammad, Umar Riaz and Yang, Yongxin and Hospedales, Timothy M. and Xiang, Tao and Gryaditskaya, Yulia and Song, Yi-Zhe},
title = {Pixelor: A Competitive Sketching AI Agent. so You Think You Can Sketch?},
year = {2020},
publisher = {Association for Computing Machinery},
volume = {39},
number = {6},
journal = {ACM Trans. Graph.},
articleno = {166},
numpages = {15}
}
</code></pre></div></div>Ayan Kumar BhuniaPaper Suppl.rlx: A modular Deep RL library for research2020-06-27T00:00:00+00:002020-06-27T00:00:00+00:00https://ayandas.me/projs/2020/06/27/rlx-deep-rl-library<p><code class="language-plaintext highlighter-rouge">rlx</code> is a Deep RL library written on top of PyTorch & built for <em>educational and research</em> purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but <code class="language-plaintext highlighter-rouge">rlx</code> is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, <code class="language-plaintext highlighter-rouge">rlx</code> adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).</p>
<p><a href="https://github.com/dasayan05/rlx" target="_blank" class="fa fa-github fa-3x" style="float: right;"></a></p>
<p>Concisely, <code class="language-plaintext highlighter-rouge">rlx</code> is supposed to</p>
<ol>
<li>Be generic (i.e., can be adopted for any task at hand)</li>
<li>Have modular lower-level components exposed to users</li>
<li>Be easy to implement new algorithms</li>
</ol>
<p>For the sake of completeness, it also provides few popular algorithms as baseline (more to be added soon). Here’s a basic example of PPO (with clipping) implementation with <code class="language-plaintext highlighter-rouge">rlx</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">base_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(</span><span class="n">horizon</span><span class="p">)</span> <span class="c1"># sample an episode as a 'Rollout' object
# 'rewards' and 'logprobs' for all timesteps
</span><span class="n">base_rewards</span><span class="p">,</span> <span class="n">base_logprobs</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">logprobs</span>
<span class="n">base_returns</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># Monte-carlo estimates of 'returns'
</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k_epochs</span><span class="p">):</span>
<span class="c1"># 'evaluate' an episode against a policy and get a new 'Rollout' object
</span> <span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">base_rollout</span><span class="p">)</span>
<span class="n">logprobs</span><span class="p">,</span> <span class="n">entropy</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">entropy</span> <span class="c1"># get 'logprobs' and 'entropy' for all timesteps
</span> <span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># .. also 'value' estimates
</span>
<span class="n">ratios</span> <span class="o">=</span> <span class="p">(</span><span class="n">logprobs</span> <span class="o">-</span> <span class="n">base_logprobs</span><span class="p">.</span><span class="n">detach</span><span class="p">()).</span><span class="n">exp</span><span class="p">()</span>
<span class="n">advantage</span> <span class="o">=</span> <span class="n">base_returns</span> <span class="o">-</span> <span class="n">values</span>
<span class="n">policyloss</span> <span class="o">=</span> <span class="o">-</span> <span class="n">torch</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">clip</span><span class="p">,</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">clip</span><span class="p">))</span> <span class="o">*</span> <span class="n">advantage</span><span class="p">.</span><span class="n">detach</span><span class="p">()</span>
<span class="n">valueloss</span> <span class="o">=</span> <span class="n">advantage</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">policyloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">valueloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">-</span> <span class="n">entropy</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="n">agent</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">agent</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>
<p>This is all you have to write to get PPO running.</p>
<h2 id="design-and-api">Design and API</h2>
<p>User needs to provide a parametric function that defines the computation at <em>each time-step</em> and follows a specific signature (i.e., <code class="language-plaintext highlighter-rouge">rlx.Parametric</code>). <code class="language-plaintext highlighter-rouge">rlx</code> will take care of the rest e.g., tie them up to form full rollouts, preserving recurrence (it works seamlessly with recurrent policies) etc.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PolicyValueModule</span><span class="p">(</span><span class="n">rlx</span><span class="p">.</span><span class="n">Parametric</span><span class="p">):</span>
<span class="s">""" Recurrent policy network with state-value (baseline) prediction """</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">states</span><span class="p">):</span>
<span class="c1"># Recurrent state from the last time-step will come in automatically
</span> <span class="n">recur_state</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">states</span>
<span class="p">...</span>
<span class="n">action1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Normal</span><span class="p">(...)</span>
<span class="n">action2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(...)</span>
<span class="n">state_value</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_value_net</span><span class="p">(...)</span>
<span class="k">return</span> <span class="n">next_recur_state</span><span class="p">,</span> <span class="n">rlx</span><span class="p">.</span><span class="n">ActionDistribution</span><span class="p">(</span><span class="n">action1</span><span class="p">,</span> <span class="n">action2</span><span class="p">,</span> <span class="p">...),</span> <span class="n">state_value</span>
<span class="n">network</span> <span class="o">=</span> <span class="n">PolicyValueModule</span><span class="p">(...)</span>
</code></pre></div></div>
<p>While the <code class="language-plaintext highlighter-rouge">next_recur_state</code> and <code class="language-plaintext highlighter-rouge">state_value</code> are optional (i.e., can be <code class="language-plaintext highlighter-rouge">None</code>), a multi-component action distribution needs to be returned. <code class="language-plaintext highlighter-rouge">rlx</code> will take care of sampling from it and computing log-probabilities. The first two return values are necessary, the rest are optional. You can return any number of quantity after first two arguments as <em>extras</em> - they will all be tracked.</p>
<hr />
<p>The design is centered around the primary data structure <code class="language-plaintext highlighter-rouge">Rollout</code> which can hold a sequence of experience tuples <code class="language-plaintext highlighter-rouge">(state, action, reward)</code>, action distributions and any arbitrary quantity returned from the <code class="language-plaintext highlighter-rouge">rlx.Parametric.forward()</code>. <code class="language-plaintext highlighter-rouge">Rollout</code> internally keeps track of the computation graph (if necessary/requested). One has to sample a <code class="language-plaintext highlighter-rouge">Rollout</code> instance by running the agent in the environment. The rollout can then provide quantities like log-probs and anything else that was tracked, upon request.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">set_grad_enabled</span><span class="p">(...):</span>
<span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">network</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># populate its 'returns' property to naive Monte-Carlo returns
</span> <span class="n">logprobs</span><span class="p">,</span> <span class="n">returns</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span>
<span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># any 'extra' quantity computed will be available as rollout.others
</span></code></pre></div></div>
<p>We can enable/disable gradients by the pytorch way (i.e., <code class="language-plaintext highlighter-rouge">torch.set_grad_enabled(..)</code> etc.).</p>
<p>The flag <code class="language-plaintext highlighter-rouge">dry=True</code> means the rollout instance will only hold <code class="language-plaintext highlighter-rouge">(state, action, reward)</code> tuples and nothing else. This design allows the rollouts to be re-evaluated against another policy - as required by some algorithms (like PPO). Such rollouts cannot offer logprobs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 'rollout' is not dry, it has computation graph attached
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">other_policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">dry_rollout</span><span class="p">)</span>
</code></pre></div></div>
<p>This API has another benefit. One can sample an episode from a policy in dry-mode, then <code class="language-plaintext highlighter-rouge">.vectorize()</code> it and re-evaluate it against the same policy. This bring in computational benefits.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">dry_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="n">Try</span><span class="p">)</span>
<span class="n">dry_rollout_vec</span> <span class="o">=</span> <span class="n">dry_rollout</span><span class="p">.</span><span class="n">vectorize</span><span class="p">()</span> <span class="c1"># internally creates a batch dimension for efficient processing
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evalue</span><span class="p">(</span><span class="n">dry_rollout_vec</span><span class="p">)</span>
</code></pre></div></div>
<p>If the rollout is not dry and gradients were enabled, one can directly do a backward pass</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span> <span class="o">*</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backword</span><span class="p">()</span>
</code></pre></div></div>
<hr />
<p>As you might have noticed, the network is not a part of the agent. In fact, the agent only has a copy of the environment and nothing else. One needs to <em>augment</em> the agent with a network in order for it to sample episode. This design allows us to easily run the agent using a different policy, for example, a “behavior policy” in off-policy RL</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>behaviour_rollout = agent(behavior_policy).episode(...)
behaviour_logprobs = behaviour_rollout.logprobs # record them for computing importance ratio afterwards
</code></pre></div></div>
<hr />
<p><code class="language-plaintext highlighter-rouge">Rollout</code> has a nice API which is useful for writing customized algorithm or implementation tricks. We can</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># shuffle rollouts ..
</span><span class="n">rollout</span><span class="p">.</span><span class="n">shuffle</span><span class="p">()</span>
<span class="c1"># .. index/slice them
</span><span class="n">rollout</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># remove the end-state
</span><span class="n">rollout</span><span class="p">[:</span><span class="mi">100</span><span class="p">]</span> <span class="c1"># recurrent rollouts can be too long (RNNs have long-term memory problems)
</span>
<span class="c1"># .. or even concat them
</span><span class="p">(</span><span class="n">rollout1</span> <span class="o">+</span> <span class="n">rollout2</span><span class="p">).</span><span class="n">vectorize</span><span class="p">()</span>
</code></pre></div></div>
<p>NOTE: I will write more docs if get time. Follow the algorithm implementations at <code class="language-plaintext highlighter-rouge">rlx/algos/*</code> for more API usage.</p>
<h2 id="installation-and-usage">Installation and usage</h2>
<p>Right now, there is no <code class="language-plaintext highlighter-rouge">pip</code> package, its just this repo. You can install it by cloning it and doing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install .
</code></pre></div></div>
<p>For example usage, follow the <code class="language-plaintext highlighter-rouge">main.py</code> script. You can test an algorithm by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python main.py --algo ppo --policytype rnn --batch_size 16 --max_episode 5000 --horizon 200 --env CartPole-v0 --standardize_return
</code></pre></div></div>
<p>The meaning of batch-size is a little different here. It means on how many rollouts the gradient will be averaged (Currently that’s how its done).</p>
<h2 id="experiments">Experiments</h2>
<ul>
<li>Basic environments</li>
</ul>
<p>The “Incomplete”-prefixed environments are examples of POMDP. Their state representations have been masked to create partial observability. They can be only be solved by recurrent policies.</p>
<center>
<img src="/public/proj_res/4/exp.png" />
</center>
<ul>
<li>A little modified (simplified) <code class="language-plaintext highlighter-rouge">SlimeVolleyGym-v0</code> <a href="https://github.com/hardmaru/slimevolleygym">environment by David Ha</a>. An MLP agent trained with PPO learns to play volleyball by self-play experiences, provided at <code class="language-plaintext highlighter-rouge">examples/slime.py</code>.</li>
</ul>
<center>
<img width="80%" src="/public/proj_res/4/volley.gif" />
</center>
<hr />
<h2 id="plans">Plans</h2>
<p>Currently <code class="language-plaintext highlighter-rouge">rlx</code> has following algorithms, but it is <strong>under active development</strong>.</p>
<ol>
<li>Vanilla REINFORCE</li>
<li>REINFORCE with Value-baseline</li>
<li>A2C</li>
<li>PPO with clipping</li>
<li>OffPAC</li>
</ol>
<h4 id="todo">TODO:</h4>
<ol>
<li>More SOTA algorithms (DQN, DDPG, etc.) to be implemented</li>
<li>Create a uniform API/interface to support Q-learning algorithm</li>
<li>Multiprocessing/Parallelization support</li>
</ol>
<h4 id="contributions">Contributions</h4>
<p>You are more than welcome to contribute anything.</p>Ayan Dasrlx is a Deep RL library written on top of PyTorch & built for educational and research purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but rlx is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, rlx adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).BézierSketch: A generative model for scalable vector sketches2020-05-22T00:00:00+00:002020-05-22T00:00:00+00:00https://ayandas.me/pubs/2020/05/22/pub-7<center>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630.pdf">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630-supp.pdf">
<i class="fa fa-files-o fa-3x"></i>Suppl.
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/7.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my ECCV '20 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/b27372d9cf7f4f5ebb9a90bb2469b36f" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk at ECCV 2020</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/g2zzaLr2VfQ" frameborder="1" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/stroke-ae">
<i class="fa fa-file-code-o fa-3x"></i>Github repo
</a>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@InProceedings{das2020bziersketch,
title = {BézierSketch: A generative model for scalable vector sketches},
author = {Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle = {The European Conference on Computer Vision (ECCV)},
year = {2020}
}
</code></pre></div></div>Ayan DasPaper Suppl.Introduction to Probabilistic Programming2020-05-05T00:00:00+00:002020-05-05T00:00:00+00:00https://ayandas.me/blog-tut/2020/05/05/probabilistic-programming<p>Welcome to another tutorial about probabilistic models, after <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">a primer on PGMs</a> and <a href="https://ayandas.me/blog-tut/2020/01/01/variational-autoencoder.html">VAE</a>. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of <strong>Probabilistic Programming</strong> has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models</a> (PGMs), but <em>equipped with imperative programming style</em> (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such <em>Universal</em> Probabilistic Programming Language, <a href="http://pyro.ai/">Pyro</a>, that came out of <a href="https://www.uber.com/us/en/uberai/">Uber’s AI lab</a> and started gaining popularity.</p>
<h1 id="overview">Overview</h1>
<p>Before I dive into details, let’s get the bigger picture clear. It is highly advisable to read any good reference about PGMs before you proceed - my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">previous article</a> for example.</p>
<h3 id="generative-view--execution-trace">Generative view & Execution trace</h3>
<p>Probabilistic Programming is NOT really what we usually think of as <em>programming</em> - i.e., completely deterministic execution of hard-coded instructions which does exactly what its told and nothing more.
Rather it is about building PGMs (must read <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">this</a>) which models our belief about the data generation process. We, as users of such language, would express a model in an imperative form which would encode all our uncertainties in the way we want. Here is a (Toy) example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">theta</span><span class="p">):</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">];</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">P</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">A</span>
<span class="k">if</span> <span class="n">A</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">P</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">A</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span>
</code></pre></div></div>
<p>If you assume this to be valid program (for now), this is what we are talking about here - all our traditional “variables” become “random variables” (RVs) and have uncertainty associated with them in the form of probability distributions. Just to give you a taste of its flexibility, here’s the constituent elements we encountered</p>
<ol>
<li>Various different distributions are available (e.g., Normal, Bernoulli, Uniform etc.)</li>
<li>We can do deterministic computation (i.e., \(P = 2 * A\))</li>
<li>Condition RVs on another RVs (i.e., \(C\vert B \sim \mathcal{N}(B, 1)\))</li>
<li>Imperative style branching allows dynamic structure of the model …</li>
</ol>
<p>Below is a graphical representation of the model defined by the above program.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/example_exectrace.png" />
</figure>
</center>
<p>Just like the invocation of a traditional compiler on a traditional program produces the desired output, this (probabilistic) program can be executed by means of “ancestral sampling”. I ran the program 5 times and each time I got samples from all my RVs. Each such “forward” run is often called an <em>execution trace</em> of the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="mf">0.5</span><span class="p">))</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.318</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.069</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.156</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.822</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.594</span><span class="p">,</span> <span class="mf">0.865</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">1.100</span><span class="p">,</span> <span class="mf">1.079</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.262</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.403</span><span class="p">)</span>
</code></pre></div></div>
<p>This is the so called “generative view” of a model. We typically use the leaf-nodes of PGMs as our observed data. And rest of the graph can be the “latent factors” of the model which we either know or want to estimate. In general, a practical PGM can often be encapsulated as a set of latent nodes \(\mathbf{Z} \triangleq \{ Z_1, Z_2, \cdots, Z_H \}\) and visible nodes \(\mathbf{X} \triangleq \{ X_1, X_2, \cdots, X_V \}\) related probabilistically as
<br />
\[
\mathbf{Z} \rightarrow \mathbf{X}
\]</p>
<h3 id="training-and-inference">Training and Inference</h3>
<p>From now on, we’ll use the general notation rather than the specific example. The model may be parametric. For example, we had the bernoulli success probability \(\theta\) in our toy example. The full joint probability is given as</p>
<p>\[
\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) = \mathbb{P}_{\theta}(\mathbf{Z}) \cdot \mathbb{P}_{\theta}(\mathbf{X}\vert \mathbf{Z})
\]</p>
<p>We would like to do two things:</p>
<ol>
<li>Estimate model parameters \(\theta\) from data</li>
<li>Compute the posterior, i.e., infer latent variables given data</li>
</ol>
<p>As discussed in my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, both of them are infeasible due to the fact that</p>
<ol>
<li>Log-likehood maximization is not possible because of the presence of latent variables</li>
<li>For continuous distributions on latent variables, the posterior is intractible</li>
</ol>
<p>The way forward is to take help of <em>Variational Inference</em> and maximize our very familiar <strong>E</strong>vidence <strong>L</strong>ower <strong>BO</strong>und (ELBO) loss to estimate the model parameters and also a set of variational parameters which help building a proxy for the original posterior \(\mathbb{P}_{\theta}(\mathbf{Z}\vert \mathbf{X})\). Mathematically, we choose a known and tractable family of distribution \(\mathbb{Q}_{\phi}(\mathbf{Z})\) (parameterized by variational parameters \(\phi\)) to approximate the posterior. The learning process is facilitated by maximizing the following</p>
<p>\[
\mathrm{ELBO}(\theta, \phi) \triangleq \mathbb{E}_{\mathbb{Q}_{\phi}} \bigl[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigr]
\]</p>
<p>by estimating gradients w.r.t all its parameters</p>
<p>\[\tag{1}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi)
\]</p>
<h1 id="black-box-variational-inference">Black-Box Variational Inference</h1>
<p><br />
If you have gone through my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, you might think you’ve seen these before. Actually, you’re right ! There is really nothing new to this. What we really need for establishing a Probabilistic Programming framework is <strong>a unified way to implement the ELBO optimization for ANY given problem</strong>. And by “problem” I mean the following:</p>
<ol>
<li>A model specification \(\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) written in a probabilistic language (like we saw before)</li>
<li>An optional (parameterized) “Variational Model” \(\mathbb{Q}_{\phi}(\mathbf{Z})\), famously known as a “Guide”</li>
<li>And .. the observed data \(\mathcal{D}\), of course</li>
</ol>
<!-- Very importantly, we CAN NOT make any *assumptions* about the inner structure of either the "model" or the "guide". This motivated the research on a "Black-box" method for solving such probabilistic programs. Please realize that this is exactly how "traditional compilers" (like C, Python) are built - they make no assumption about the symantic meaning/structure of your program .. they just check for syntactic validity. -->
<p>But, how do we compute (1) ? The appearent problem is that gradient w.r.t. \(\phi\) is required but it appears in the expectation itself. To mitigate this, we make use of the famous trick known as the “log-derivative” trick (it actually has many other names like REINFORCE etc). For notational simplicy let’s denote \(f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \triangleq \log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})\) and continue from (1)</p>
\[\sum_{\mathbf{Z}} \nabla_{[\theta, \phi]} \bigg[ \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]
=\sum_{\mathbf{Z}} \bigg[ \nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \bigg[ \color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot \frac{\nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z})}{\color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})}} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \bigg[ \color{red}{\nabla_{\phi} \log\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[\tag{2}
= \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{[\theta, \phi]} \bigg( \underbrace{\log\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \overline{f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}
+f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}_\text{Surrogate Objective} \bigg) \bigg]\]
<p>Eq. (2) shows that the trick helped the \(\nabla_{[\theta, \phi]}\) to penetrate the \(\mathbb{E}[\cdot]\), but in the process, it changed the original \(f\) with a “<a href="https://arxiv.org/abs/1506.05254">surrogate</a> function” \(f_{surr} \triangleq \overline{f}\cdot\log\mathbb{Q}+f\) where the <em>bar</em> protects a quantity from differentiation. Equation (2) is all we need - it provides an insight on how to make the gradient estimation practical. In fact, it can be proven theoretically that this gradient is an unbiased estimate of the true gradient in Equation (1).</p>
<p>Succinctly, we run the Guide \(L\) times to record a set of \(L\) execution-traces (i.e., samples \(\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}\)) and compute the following Monte-Carlo approximation to Equation (2)</p>
<p>\[\tag{3}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi) \approx \frac{1}{L} \sum_{\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}} \left[ \nabla_{[\theta, \phi]} f_{surr}(\mathbf{\widehat{Z}}, \mathcal{D}) \right]_{\theta=\theta_{old}, \phi=\phi_{old}}
\]</p>
<p>The nice thing about Equation (2) (or equivalently Equation (3)) is we got the differentiation operator right on top of a deterministic function (i.e., \(f_{surr}\)). It means we can construct \(f_{surr}\) as a computation graph and take advantage of modern day automatic differentaition engines. Here’s how the computation graph and the graphical model are linked</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/gm_cg.png" />
</figure>
</center>
<p>Last but not the least, let’s look at the function \(f_{surr}\) which is basically built on the log-density terms \(\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) and \(\log \mathbb{Q}_{\phi}(\mathbf{Z})\). We need a way to compute them flexibly. Please remember that the model and guide is written in a <em>language</em> and hence we have access to their graph-structure. A clever software implementation can harness this structure to estimate the log-densities (and eventually \(f_{surr}\)).</p>
<p>I claimed before that the gradient estimates are unbiased. However, such generic way of computing the gradient introduces high variance in the estimate and make things unstable for complex models. There are few tricks used widely to get around them. But please note that such tricks always exploits model-specific structure. Three such tricks are presented below.</p>
<h3 id="i-re-parameterization">I. Re-parameterization</h3>
<p>We might get lucky that \(\mathbb{Q}_{\phi}(\mathbf{Z})\) is <a href="https://arxiv.org/abs/1312.6114">re-parameterizable</a>. What that means is the expectation w.r.t \(\mathbb{Q}_{\phi}(\mathbf{Z})\) can be made free of its parameters and by doing so the gradient operator can be pushed inside without going through the log-derivative trick.
So, let’s step back a bit and consider the original ELBO gradient in (1). Assuming re-parameterizable nature, the following can be done
\[
\nabla_{[\theta, \phi]} \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigg] = \nabla_{[\theta, \phi]} \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\log \mathbb{P}_{\theta}(G_{\phi}(\epsilon), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg]
\]
\[
= \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\nabla_{[\theta, \phi]} \bigg( \log \mathbb{P}_{\theta}(G_{\phi}(\mathbf{\epsilon}), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg) \bigg]
\]</p>
<p>Where \(Q(\mathbf{\epsilon})\) is an independent source of randomness. Computing this expectation with empirical average (just like Eq.2) gives us a better (variance reduced) estimate of the true gradient of ELBO.</p>
<h3 id="ii-rao-blackwellization">II. Rao-Blackwellization</h3>
<p>This is another well-known variance reduction technique. It is a bit mathematically rigorous, so I will explain it simply without making it confusing. This requires the full variational distributions to have some kind of factorization. A specific case is when we have mean-field assumption, i.e.</p>
<p>\[
\mathbb{Q}_{\phi}(\mathbf{Z}) = \prod_i Q_{\phi_i}(Z_i)
\]</p>
<p>With a little effort, we can pull out the gradient estimator for each of these \(\phi_i\) parameters from (2). They look something like this</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) = \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})} \bigg)
+\cdots \bigg]\]
<p>The reason why the quantity under bar still has all the factors because it is immune to gradient operator. Also because the expectation is outside the gradient operator, it contains all factors. At this point, the Rao-Blackwellization offers a variance-reduced estimate of the above gradient, i.e.,</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) \approx \mathbb{E}_{\mathbb{Q}_{\phi}^{(i)}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \mathbf{X}) - \log \mathbb{Q}_{\phi_i}(Z_i)} \bigg)
+\cdots \bigg]\]
<p>where \(\mathbf{Z}^{(i)}\) is the set of variables that forms the “markov blanket” of \(Z_i\) w.r.t to the structure of guide, \(\mathbb{Q}_{\phi}^{(i)}\) is the part of the variational distribution that depends on \(\mathbf{Z}^{(i)}\) and \(\mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \cdot)\) is the factors of the model that involves \(\mathbf{Z}^{(i)}\).</p>
<h3 id="iii-explicit-enumeration-for-discrete-rvs">III. Explicit enumeration for Discrete RVs</h3>
<p>While exploiting the graph structure of the guide while simplifying (1), we might end up getting a term like this due to factorization in the guide density</p>
<p>\[
\mathbb{E}_{Z_i\sim\mathbb{Q}_{\phi_i}(Z_i)} \bigl[ f(\cdot) \bigr]
\]</p>
<p>If it happens that the variable \(Z_i\) is discrete with the size of its state space reasonably small (e.g., a \(d=5\) dimensional binary RV has \(2^5 = 32\) states), we can replace sampling-based empirical expectations with true expectation where we have to evaluate a sum over its entire state-space</p>
<p>\[
\sum_{Z_i} \mathbb{Q}_{\phi_i}(Z_i)\cdot f(\cdot)
\]</p>
<p>So make sure the state-space is resonable in size. This helps reducing the variance quite a bit.</p>
<p>Whew ! That’s a lot of maths. But good thing is, you hardly ever have to think about them in detail because software engineers have put tremendous effort to make these algorithms as easily accessible as possible via libraries. One of them we are going to have a brief look on.</p>
<h1 id="pyro-universal-probabilistic-programming"><code class="language-plaintext highlighter-rouge">Pyro</code>: Universal Probabilistic Programming</h1>
<p><a href="http://pyro.ai/">Pyro</a> is a probabilistic programming framework that allows users to write flexible models in terms of a simple API. Pyro is written in Python and uses the popular PyTorch library for its internal representation of computation graph and as auto differentiation engine. Pyro is quite expressive due to the fact that it allows the model/guide to have fully imperative flow. It’s core API consists of these functionalities</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">pyro.param()</code> for defining learnable parameters.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.dist</code> contains a large collection of probability distribution.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.sample()</code> for sampling from a given distribution.</li>
</ol>
<p>Let’s take a concrete example and work it out.</p>
<h4 id="problem-mixture-of-gaussian">Problem: Mixture of Gaussian</h4>
<p>MoG (Mixture of Gaussian) is a realatively simple but widely studied probabilistic model. It has an important application in soft-clustering. For the sake of simplicity we assume we only have two mixtures. The generative view of the model is basically this: we flip a coin (latent) with bias \(\rho\) and depending on the outcome \(C\in \{ 0, 1 \}\) we sample data from either of the two gaussian \(\mathcal{N}(\mu_0, \sigma_0)\) and \(\mathcal{N}(\mu_1, \sigma_1)\)</p>
\[C_i \sim Bernoulli(\rho) \\
X_i \sim \mathcal{N}(\mu_{C_i}, \sigma_{C_i})\]
<p>where \(i = 1 \cdots N\) is data index, \(\theta \triangleq \{ \rho, \mu_1, \sigma_1, \mu_2, \sigma_2 \}\) is the set of model parameters we need to learn. This is how you write that in Pyro:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Take the observation
</span> <span class="c1"># Define coin bias as parameter. That's what 'pyro.param' does
</span> <span class="n">rho</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"rho"</span><span class="p">,</span> <span class="c1"># Give it a name for Pyro to track properly
</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">]),</span> <span class="c1"># Initial value
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># Has to be in [0, 1]
</span> <span class="c1"># Define both means and std with random initial values
</span> <span class="n">means</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"M"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.5</span><span class="p">,</span> <span class="mf">3.</span><span class="p">]))</span>
<span class="n">stds</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"S"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]),</span>
<span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">positive</span><span class="p">)</span> <span class="c1"># std deviation cannot be negative
</span>
<span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># Mark conditional independence
</span> <span class="c1"># construct a Bernoulli and sample from it.
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">rho</span><span class="p">))</span> <span class="c1"># c \in {0, 1}
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">LongTensor</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="n">stds</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="c1"># pick a mean as per 'c'
</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"x"</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">obs</span><span class="o">=</span><span class="n">data</span><span class="p">)</span> <span class="c1"># sample data (also mark it as observed)
</span></code></pre></div></div>
<p>Due to the discrete and low dimensional nature of the latent variable \(C\), this problem is in general tracktable in terms of computing posterior. But let’s assume it is not. The true posterior \(\mathbb{P}(C_i\vert X_i)\) is the quantity known as “assignment” that reveals the latent factor, i.e., what was the coin toss result when a given \(X_i\) was sampled. We define a guide on \(C\), parameterized by variational parameters \(\phi \triangleq \{ \lambda_i \}_{i=1}^N\)</p>
\[C_i \sim Bernoulli(\lambda_i)\]
<p>In Pyro, we define a guide that encodes this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">guide</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Guide doesn't require data; just need the value of N
</span> <span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># conditional independence
</span> <span class="c1"># Define variational parameters \lambda_i (one for every data point)
</span> <span class="n">lam</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"lam"</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)),</span> <span class="c1"># randomly initiallized
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># \in [0, 1]
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="c1"># Careful, this name HAS TO BE same to match the model
</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">lam</span><span class="p">))</span>
</code></pre></div></div>
<p>We generate some synthetic data from the following simualator to train our model on.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">getdata</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">mean1</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span> <span class="n">mean2</span><span class="o">=-</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">std1</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">std2</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="n">D1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std1</span> <span class="o">+</span> <span class="n">mean1</span>
<span class="n">D2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std2</span> <span class="o">+</span> <span class="n">mean2</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">D1</span><span class="p">,</span> <span class="n">D2</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">D</span><span class="p">)</span>
<span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">D</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">))</span>
</code></pre></div></div>
<p>Finally, Pyro requires a bit of boilerplate to setup the optimization</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">getdata</span><span class="p">(</span><span class="mi">200</span><span class="p">)</span> <span class="c1"># 200 data points
</span><span class="n">pyro</span><span class="p">.</span><span class="n">clear_param_store</span><span class="p">()</span>
<span class="n">optim</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">({})</span>
<span class="n">svi</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">infer</span><span class="p">.</span><span class="n">SVI</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">guide</span><span class="p">,</span> <span class="n">optim</span><span class="p">,</span> <span class="n">infer</span><span class="p">.</span><span class="n">Trace_ELBO</span><span class="p">())</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">svi</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>That’s pretty much all we need. I have plotted the (1) ELBO loss, (2) Variational parameter \(\lambda_i\) for every data points, (3) The two gaussians in the model and (4) The coin bias as the training progresses.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/16/example_loss.gif" />
</figure>
</center>
<p>The full code is available in this gist: <a href="https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8">https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8</a>.</p>
<hr />
<p>That’s all for today. Hopefully I was able to convey the bigger picture of probabilistic programming which is quite useful for modelling lots of problems. The following references the sources of information while writing the article. Interested readers are encouraged to check them out.</p>
<ol>
<li><a href="http://pyro.ai/examples/svi_part_iii.html">Pyro’s VI tutorial</a></li>
<li><a href="https://arxiv.org/abs/1401.0118">Black Box variational inference</a></li>
<li><a href="https://arxiv.org/abs/1506.05254">Gradient Estimation Using Stochastic Computation Graphs</a></li>
<li><a href="https://arxiv.org/abs/1701.03757">Deep Probabilistic Programming</a></li>
<li><a href="https://arxiv.org/abs/1810.09538">Pyro: Deep Universal Probabilistic Programming</a></li>
</ol>Ayan DasWelcome to another tutorial about probabilistic models, after a primer on PGMs and VAE. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of Probabilistic Programming has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building Probabilistic Graphical Models (PGMs), but equipped with imperative programming style (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such Universal Probabilistic Programming Language, Pyro, that came out of Uber’s AI lab and started gaining popularity.Patterns of Randomness2020-04-15T00:00:00+00:002020-04-15T00:00:00+00:00https://ayandas.me/blog-tut/2020/04/15/patterns-of-randomness<p>Welcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as <em>Mathematical Art</em>. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the <a href="https://github.com/dasayan05/patterns-of-randomness">code</a>. <strong>Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient</strong>.</p>
<h1 id="-random-walk--brownian-motion-">[ Random Walk & Brownian Motion ]</h1>
<p>Let’s start with something simple. Consider a Random Variable \(\mathbf{R}_t\) (\(t\) being time) with support \(\{ -1, +1\}\) with equal probability on both of its possible values. Think of it as a <em>score</em> you get at time \(t\) which can be either \(+1\) or \(-1\) as a result of an unbiased coin-flip. In terms of probability:</p>
<p>\[
\mathbb{P}\bigl[ \mathbf{R}_t = +1 \bigr] = \mathbb{P}\bigl[ \mathbf{R}_t = -1 \bigr] = \frac{1}{2}
\]</p>
<p>Realization (samples) of \(\mathbf{R}_t\) for \(t=0 \rightarrow T (=10)\) would look like
\[
\bigl[ +1, -1, -1, +1, -1, -1, -1, +1, +1, -1, +1 \bigr]
\]</p>
<p>Let us define another Random Variable \(\mathbf{S}_t\) which is nothing but an accumulator of \(\mathbf{R}_t\) till time \(t\). So, by definition</p>
<p>\[
\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i
\]
Realization of \(\mathbf{S}_t\) corresponding to above \(\mathbf{R}_t\) sequence would look like
\[
\bigl[ +1, 0, -1, 0, -1, -2, -3, -2, -1, -2, -1 \bigr]
\]</p>
<p>This is popularly known as the <strong>Random Walk</strong>. With the basics ready, let us have two such random walks namely \(\mathbf{S}^x_t\) and \(\mathbf{S}^y_t\) and treat them as \(X\) and \(Y\) coordinates of a <em>Random Vector</em> namely \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\).</p>
<p>As of now it look all nice and mathy, right ! Here’s the fun part. Let me keep the time (i.e., \(t\)) running and keep track on the path that the vector \(\bar{\mathbf{S}}_t\) traces on a 2D plane</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/2d_disc_brown.gif" />
</figure>
</center>
<p>It will create a cool random checkerboard-like pattern as time goes on. Looking at the tip (the ‘dot’), you might see it as a tiny particle. As it happened that this is a discretized verision of a continuous <a href="http://www1.lsbu.ac.uk/water/Brownian.html">phenomenon observed in real microscopic particles in fluid</a>, famously known as <strong>Brownian Motion</strong>.</p>
<p>Real Brownian Motion is continuous. Let’s work it out, but very briefly. We divide an arbitrary time interval \([0, T]\) into \(N\) small intervals of length \(\displaystyle{ \Delta t = \frac{T}{N} }\) and have a modified score Random Variable \(\mathbf{R}_t\) with support \(\displaystyle{ \left\{ +\sqrt{\frac{T}{N}}, -\sqrt{\frac{T}{N}} \right\} }\) with equal probability as before. We still have the same definition of \(\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i\). It so happened that as we appraoch the limiting case of</p>
<p>\[
N \rightarrow \infty,\text{ and consequently } \sqrt{\frac{T}{N}} \rightarrow 0\text{ and } \Delta t\rightarrow 0
\]</p>
<p>it gives us the continuous analogue of <strong>Brownian Motion</strong>. Similar to the discrete case, if we trace the path of \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\) with large \(N\) (yes, in practice we cannot go to infinity, sorry), patterns like this will emerge</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/brown.gif" />
</figure>
</center>
<p>To make it more artistic, I took an even bigger \(N\) and ran the simulation for quite a while and got quite beautiful jittery patterns. Random numbers being at the heart of the phenomenon, we’ll get different patterns in different runs. Here are two such simulation results:</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/brownian_full.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Brownian_motion">Wikipedia</a></li>
<li><a href="https://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric BM</a></li>
<li><a href="https://en.wikipedia.org/wiki/It%C3%B4_calculus">Stochastic Calculus</a></li>
</ol>
<h1 id="-dynamical-systems--chaos-">[ Dynamical Systems & Chaos ]</h1>
<p>Dynamical Systems are defined by a state space \(\mathbb{R}^n\) and a system dynamics (a function \(\mathbf{F}\)). A state \(\mathbf{x}\in\mathbb{R}^n\) is a specific (abstract) configuration of a system and the dynamics determines how the state “evolves” over time. The dynamics is often represented by a <a href="https://en.wikipedia.org/wiki/Differential_equation">differential equation</a> that specifies the chnage of state over time. So,</p>
<p>\[
\mathbf{F}(\mathbf{x}, t) \triangleq \frac{d\mathbf{x}}{dt}
\]</p>
<p>The true states of the system at some point of time is determined by solving and Initial Value Problem (IVP) starting from an initial state \(\mathbf{x}_0\). We then solve consecutive states with \(t\gt 0\) as</p>
<p>\[
\mathbf{x}_t = \mathbf{x}_0 + \Delta t \cdot \mathbf{F}(\mathbf{x}, t)
\]</p>
<p>Having sufficiently small \(\Delta t\) ensures propert evolution of states.</p>
<p>Now this may seem quite trivial, at least to those who have studied Differential Equations. But, there are specific cases of \(\mathbf{F}\) which leads to an evolution of states whose trajectory is surprisingly beautiful. For reasons that are beyond the scope of this article, these are called <strong>Chaos</strong>. There is a specific branch of dynamical systems (named “<a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a>”) that deals with characteristics of such chaotic systems. Below are three such chaotic systems with there trajectory visualized in 3D state space. To be specific, we take each system with an initial state (they are very sensitive to initial states) and compute successive states with a small enough \(\Delta t\) and visualize them as a continuous path in 3D. The corresponding figures depict an animation of the evolution of states over time as well as the whole trajectory all at once.</p>
<h3 id="lorentz-system">Lorentz System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ \sigma (y-x), x(\rho - z) - y, xy - \beta z \bigr]^T
\]
\[
\text{with }\sigma = 10, \beta = \frac{8}{3}, \rho = 28 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/lorentz.gif" />
</figure>
</center>
<h3 id="rössler-system">Rössler System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -(y+z), x+Ay, B+xz-Cz \bigr]^T
\]
\[
\text{with }A=0.2, B=0.2, C=5.7 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/roseller.gif" />
</figure>
</center>
<h3 id="halvorsen-system">Halvorsen System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -ax-4y-4z-y^2, -ay-4z-4x-z^2, -az-4x-4y-x^2 \bigr]^T
\]
\[
\text{with }a=1.89 \text{, and } \mathbf{x}_0 = \bigl[ -1.48, -1.51, 2.04 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/helvorsen.gif" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Differential_equation">Differential Equation</a>, <a href="https://en.wikipedia.org/wiki/Dynamical_system">Dynamical System</a></li>
<li><a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a></li>
<li><a href="https://en.wikipedia.org/wiki/Attractor">Attractors</a>, <a href="http://www.stsci.edu/~lbradley/seminar/attractors.html">Strange Attractors</a></li>
<li><a href="https://en.wikipedia.org/wiki/Lorenz_system">Lorentz System</a>, <a href="https://en.wikipedia.org/wiki/R%C3%B6ssler_attractor">Rössler System</a>, <a href="https://www.dynamicmath.xyz/calculus/velfields/Halvorsen/">Halvorsen System</a></li>
</ol>
<h1 id="-complex-fourier-series-">[ Complex Fourier Series ]</h1>
<p>We all know about Fourier Series, right ! But I am sure not all of you have seen this artistic side of it. Well, this isn’t really related to fourier series, but fourier series helps in creating them.</p>
<p>We know the following to be the “synthesis equation” of complex fourier series</p>
<p>\[
f(t) = \sum_{n=-\infty}^{+\infty} c_n e^{j \frac{2\pi n}{T} t} \in \mathbb{C}
\]</p>
<p>which represents the synthesis of a periodic function \(f(t)\) of period \(T\) from its frequency components \(\mathbf{C} \triangleq \left[ c_{-\infty}, \cdots, c_{-2}, c_{-1}, c_{0}, c_{+1}, c_{+2}, \cdots, c_{+\infty} \right]\). Often, as a practical measure, we crop the infinite summation to a limited range \([ -N, N ]\). Furthermore, let’s consider \(T=1\) without lose of generality. So, we see \(f(t)\) as a function parameterized by the frequence components \(\mathbf{C} \in \mathbb{C}^{2N+1}\)</p>
<p>\[
f(t, \mathbf{C}) \approx \sum_{n=-N}^{+N} c_n e^{j 2\pi n t} \in \mathbb{C}
\]</p>
<p>By doing this, we can make complex valued functions by putting different \(\mathbf{C}\) and running \(t=0\rightarrow 1\). However, not all \(\mathbf{C}\) leads to anything visually appealing. A particular feature of an object that appeals to the human eyes is “Symmetry”. We are gonna exploit this here. A little refresher on fourier series will make you realize that if the coefficients are real-valued, then \(f(t, \mathbf{C})\) has symmetric property. And that’s all we need.</p>
<p>We pick random \(\mathbf{C} \in \mathbb{R}^{2N+1}\) (see, its real numbers now) and run the clock \(t=0\rightarrow 1\) and trace the path travelled by the complex point \(f(t, \mathbf{C}) \in \mathbb{C}\) as time progresses. It creates patterns like the ones shown below</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_6.gif" />
</figure>
</center>
<p>There is one way to customize these - the value of \(N\). As we know that \(c_n\) has the interpretation of the magnitude of \(n^{th}\) frequency component. A large value of \(N\) implies the introduction of more high frequency into the time-domain signal. This visually leads to \(f(t)\) having finer details (i.e., more curves and bendings). Lowering the value of \(N\) would clear out these fine details and the path will become more and more flat. The below image shows decreasing value of \(N = 10 \rightarrow 6\) along columns. You can see the patterns losing details as we go right. And just like before, every run will create different patterns as they are solely controlled by random numbered coefficients.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_10_6.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="http://www.ee.ic.ac.uk/hp/staff/dmb/courses/E1Fourier/00300_ComplexFourier.pdf">Complex Fourier Series</a></li>
<li><a href="http://www.jezzamon.com/fourier/">Fourier patterns</a></li>
<li><a href="https://www.youtube.com/watch?v=ds0cmAV-Yek">Visualizing fourier series</a></li>
<li><a href="https://www.youtube.com/watch?v=r6sGWTCMz2k&t=725s">Amazing Video by 3Blue1Brown</a></li>
</ol>
<h1 id="-mandelbrot--julia-set-">[ Mandelbrot & Julia set ]</h1>
<p>These two sets are very important in the study of “Fractals” - objects with self-repeating patterns. Fractals are extremely popular concepts in certain branches of mathematics but they are mostly famous for having eye-catching visual appearance. If you ever come across an article about fractals, you are likely to see some of the most artistic patterns you’ve ever seen in the context of mathematics. Diving into the details of fractals and self-repeating patterns will open a vast world of “Mathematical Art”. Although, in this article, I can only show you a tiny bit of it - two sets namely “Mandelbrot” and “Julia” set. Let’s start with the <em>all important function</em></p>
<p>\[
f_C(z) = z^2 + C
\]</p>
<p>where \(C, f_C(z), z \in \mathbb{C}\) are complex numbers. This appearantly simply complex-valued function is in the heart of these sets. All it does is squares its argument and adds a complex number that the function is parameterized with. Also, we denote \(f^{(k)}_C(z)\) as \(k\) times repeated application of the function on a given \(z\), i.e.</p>
<p>\[
f^{(k)}_C(z) = f_C(\cdots f_C(f_C(z)))
\]</p>
<h3 id="mandelbrot-set">Mandelbrot Set</h3>
<p>With these basic definitions in hand, the <strong>Mandelbrot set</strong> (invented by mathematician <a href="https://en.wikipedia.org/wiki/Benoit_Mandelbrot">Benoit Mandelbrot</a>) is the set of all \(C\in\mathbb{C}\) for which
\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(0+0j) \vert < \infty
\]</p>
<p>Simply put, there is a set of values for \(C\) where if you repeatedly apply \(f_C\) on zero (i.e. \(0+0j\)), the output <em>does not diverge</em>. All such values of \(C\) makes the so called “Mandelbrot Set”. For the values of \(C\) that does not diverge, can be characterized by how many repeated application of \(f_C(\cdot)\) they can tolerate before their absolute value goes higher than a predefined “<em>escape radius</em>”, let’s call it \(r\in\mathbb{R}\). This creates a loose sense of “strength” of a certain \(C\) that can be written as</p>
<p>\[
\mathbb{K}(C) = \max_{\vert f^{(k)}_C(0+0j) \vert \leq r} k
\]</p>
<p>It might look all strange but if you treat the integer \(\mathbb{K}(C)\) as grayscale intensity value for a grid of points on 2D complex plane (i.e., an image), you will get a picture similar to this (Don’t get confused, the picture is indeed grayscale; I added PyPlot’s <a href="https://matplotlib.org/tutorials/colors/colormaps.html"><code class="language-plaintext highlighter-rouge">plt.cm.twilight_shifted</code></a> colormap for enhancing the visual appeal). The grid is in the range \((-2.5+1.5j) \rightarrow (1.5-1.5j)\) and the escape radius is \(r=2.5\).</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_thumbnail.png" />
</figure>
</center>
<p>What is so fascinating about this pattern is the fact that it is self-repeating. If you zoom into a small portion of the image, you would see the same pattern again.</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_zoom.png" />
</figure>
</center>
<h3 id="julia-set">Julia Set</h3>
<p>Another very similar concept exists, called the “Julia Set” which exhibits similar visual \(\mathbb{K}\) diagram. Unlike Mandelbrot set, we consider a \(z\in\mathbb{C}\) to be in Julia set \(\mathbf{J}_C\) if</p>
<p>\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(z) \vert < \infty
\]</p>
<p>Please note that this time the set is parameterized by \(C\) and we are interested in how the <em>argument of the function</em> behaves under repeated application of \(f_C(\cdot)\). Now things from here are similar. We define a similar “strength” for every \(z\in\mathbb{C}\) as</p>
<p>\[
\mathbb{K}_C(z) = \max_{\vert f^{(k)}_C(z) \vert \leq r} k
\]</p>
<p>Please note that as a result of this new definition, the \(\mathbb{K}\) diagram is parameterized by \(C\), i.e., we will get different image for different \(C\). In principle, we can visualize such images for different \(C\) (they are indeed pretty cool), but let’s go a bit further than that. We will vary \(C\) along a trajectory and produce the \(\mathbb{K}\) diagrams for each \(C\) and see them as an animation. This creates an amazing visual effect. Technically, I varied \(C\) along a circle of radius \(R = 0.75068\), i.e., \(C = R e^{j\theta}\) with \(\theta = 0\rightarrow 2\pi\)</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/julia1.gif" />
</figure>
</center>
<p><strong>Want to know more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Fractal">Fractals</a></li>
</ol>
<hr />
<p>Alright then ! That is pretty much it. Due to constraint of time, space and scope its not possible to explain everything in detail in one article. There are plenty of resources available online (I have already provided some link) which might be useful in case you are interested. Feel free to explore the details of whatever new you learnt today. If you would like to reproduce the diagrams and images, please use the code here <a href="https://github.com/dasayan05/patterns-of-randomness">https://github.com/dasayan05/patterns-of-randomness</a> (sorry, the code is a bit messy, you have to figure out).</p>Ayan DasWelcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as Mathematical Art. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the code. Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient.