Jekyll2021-09-02T18:08:43+00:00https://ayandas.me/feed.xmlAyan Das<b>Deep Learning</b> enthusiast; <b>Ph.D. Student</b> @ <a href="https://www.surrey.ac.uk/">University of Surrey</a>, United KingdomAyan Dasa.das@surrey.ac.ukanyx: Build vector animations from programmatic descriptions2021-05-01T00:00:00+00:002021-05-01T00:00:00+00:00https://ayandas.me/projs/2021/05/01/anyx<p>Project <code class="language-plaintext highlighter-rouge">anyx</code> (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although <code class="language-plaintext highlighter-rouge">anyx</code> is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. <code class="language-plaintext highlighter-rouge">anyx</code> is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like <a href="https://www.pygame.org/news">pygame</a>, <code class="language-plaintext highlighter-rouge">anyx</code> allows users to simply write a <em>description</em> of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of <code class="language-plaintext highlighter-rouge">anyx</code> is motivated largely by a similar project called <a href="https://github.com/3b1b/manim">manim</a>.</p>
<p><a href="https://ayandas.me/anyx" target="_blank" class="fa fa-github fa-3x" style="float: left; margin-right: 20px;"></a></p>
<h2 id="i-work-on-this-project-only-in-my-spare-time-and-its-not-done-yet-read-a-brief-description-by-clicking-on-the-github-icon">I work on this project only in my spare time and its not done yet. Read a brief description by clicking on the github icon.</h2>Ayan DasProject anyx (pronounced as “anix”) is a python library designed for easily producing high-quality (vector) graphics animations with ease. Although anyx is built with no assumption about its downstream area of application, it is mostly targeted towards scientific community for creating beautiful scientific/technical illustrations. anyx is created as a programmatic alternative to heavyweight and (sometimes) proprietary graphical software. Unlike low-level libraries like pygame, anyx allows users to simply write a description of a target scene and compile it down to the required modality (Video, Animated GIFs etc). The development of anyx is motivated largely by a similar project called manim.Cloud2Curve: Generation and Vectorization of Parametric Sketches2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://ayandas.me/pubs/2021/03/01/pub-9<center>
<a target="_blank" class="pubicon" href="https://arxiv.org/pdf/2103.15536.pdf">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/9.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. We further aim to model sketches as a sequence of low-dimensional parametric curves. To this end, we propose an inverse graphics framework capable of approximating a raster or waypoint based stroke encoded as a point-cloud with a variable-degree Bézier curve. Building on this module, we present Cloud2Curve, a generative model for scalable high-resolution vector sketches that can be trained end-to-end using point-cloud data alone. As a consequence, our model is also capable of deterministic vectorization which can map novel raster or waypoint based sketches to their corresponding high-resolution scalable Bézier equivalent. We evaluate the generation and vectorization capabilities of our model on Quick, Draw! and K-MNIST datasets.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my CVPR '21 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/ff2a87e58efe4d72a32f008e53826776" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk at CVPR 2021</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/H8-ejwYk7PY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{das2021cloud2curve,
title={Cloud2Curve: Generation and Vectorization of Parametric Sketches},
author={Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
year={2021},
eprint={2103.15536},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
</code></pre></div></div>Ayan DasPaperDifferentiable Programming: Computing source-code derivatives2020-09-08T00:00:00+00:002020-09-08T00:00:00+00:00https://ayandas.me/blog-tut/2020/09/08/differentiable-programming<p>If you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like <a href="https://www.facebook.com/yann.lecun/posts/10155003011462143">Yann LeCun</a>, <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Andrej Karpathy</a>) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a <em>recorder</em> (often called “Tape”) that builds a computation graph <em>at runtime</em> and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an <em>implementation perspective</em> - it doesn’t really “propagate” anything. It consumes a “program” in the form of <em>source code</em> and produces the “Derivative program” (also source code) w.r.t its inputs without <em>ever actually running</em> either of them. Additionally, DiffProg allows users the flexibility to write <em>arbitrary programs</em> without constraining it to any “guidelines”.
In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>”, written in <a href="https://julialang.org/">Julia</a>) gaining popularity in the Deep Learning community.</p>
<h2 id="why-need-derivatives-in-dl-">Why need Derivatives in DL ?</h2>
<p>This is easy to answer but just for the sake of completeness - we are interested in computing derivatives of a function because of its requirement in the update rule of Gradient Descent (or any of its successor):</p>
<p>\[
\Theta := \Theta - \alpha \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta}
\]</p>
<p>Where \(\Theta\) is the set of all parameters, \(\mathcal{D}\) is the data and \(F(\Theta)\) is the function (typically loss) we want to differentiate. Our ultimate goal is to compute \(\displaystyle{ \frac{\partial F(\Theta; \mathcal{D})}{\partial \Theta} }\) given the <em>structural form</em> of \(F\). The standard way of doing this is to use “Automatic Differentiation” (AutoDiff or AD), or rather, a special case of it called “Backpropagation”. It is called Backpropagation only when the function \(F(\cdot)\) is scalar, which is mostly true in cases we care about.</p>
<h2 id="pullback-functions--backpropagation">“Pullback” functions & Backpropagation</h2>
<p>We will now see how gradients of a complex function (given its full specification) can be computed as a sequence of primitive operations. Let’s explain this with an example for simplicity: We have two inputs \(a, b\) (just symbols) and a description of the <em>scalar</em> function we want to differentiate:</p>
<p>\[
\displaystyle{F(a, b) = \frac{a}{1+b^2}}
\]</p>
<p>We can think of \(F(a, b)\) as a series of smaller computations with intermediate results, like this</p>
\[\begin{align}
y_1 &← pow(b, 2) \\
y_2 &← add(1, y_1) \\
y_3 &← div(a, y_2)
\end{align}\]
<p>I changed the pure math notations to more programmatic ones; but the meaning remains same. In order to compute gradients, we <em>augment</em> these computations and create something called a “pullback” function as an additional by-product.</p>
<p>Mathematically, the actual computation and pullback creation can be written together symbolically as:</p>
\[\tag{1}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(pow, b, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(add, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(div, a, y_2)
\end{align}\]
<p>You can think of the <em>functional</em> \(\mathcal{J}\) as the internals of the Backpropagation framework which mutates all the computation units to produce an extra entity. A pullback function (\(\mathcal{B}_i\)) is a function that takes input the gradient w.r.t the output of the corresponding function and returns the gradient w.r.t inputs of the function:</p>
<p>\[
\mathcal{B}_i : \overline{y}_i \rightarrow \overline{input_1}, \overline{input_2}, \cdots
\]</p>
<p>It is really nothing but a different view of the chain-rule of differentiation:</p>
\[\begin{align}
\frac{\partial F}{\partial b} &\leftarrow \mathcal{B}_1(\frac{\partial F}{\partial y_1}) \triangleq \frac{\partial F}{\partial y_1} \cdot \frac{\partial y_1}{\partial b} \\
\overline{b} &\leftarrow \mathcal{B}_1( \overline{y}_1 ) \triangleq \overline{y}_1 \cdot \frac{\partial y_1}{\partial b}\left[ \text{Denoting } \frac{\partial F}{\partial x}\text{ as }\overline{x} \right]
\end{align}\]
<p>We must also realize that computing \(\mathcal{B}_i\) may require values from the forward pass. For example, computing \(\overline{b}\) may need evaluating \(\displaystyle{ \frac{\partial y_1}{\partial b} }\) at the given value of \(b\). After getting access to \(\mathcal{B}_i\), we can compute the derivatives of \(F\) w.r.t \(a, b\) by invoking the pullback functions in proper (reverse) order</p>
\[\begin{align}
\overline{a}, \overline{y_2} &\leftarrow \mathcal{B}_3(\overline{y}_3) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\overline{b} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>Please note that \(y_3\) is actually \(F\) and hence \(\overline{y}_3 ≜ \displaystyle{ \frac{\partial F}{\partial y_3} = 1 }\).</p>
<h2 id="1-tape-based-implementation">1. Tape-based implementation</h2>
<p>There are couple of different ways of implementing the theory described above. The de-facto way of doing it (as of this day) is something known as “tape-based” implementation. <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow Eager Execution</code> are probably the most popular example of this type.</p>
<p>In tape-based systems, the function \(F(..)\) is specified by its full structural form. Moreover, it requires <em>runtime execution</em> in order to compute anything (be it output or derivatives). Such system keeps track of every computation via a recorder or “tape” (that’s why the name) and builds an internal computation graph. Later, when requested, the tape stops recording and works its way backwards through the recorded tape to compute derivatives.</p>
<h3 id="the-specification-of-ftheta">The specification of \(F(\Theta)\)</h3>
<p>A tape-based system requires users to provide the function \(F\) as a description of its computations following a certain guidelines. These guidelines are provided by the specific AutoDiff framework we use. Take <code class="language-plaintext highlighter-rouge">PyTorch</code> for example - we write the series of computations using the API provided by <code class="language-plaintext highlighter-rouge">PyTorch</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">b0</span> <span class="o">=</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">b0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">y1</span><span class="p">)</span>
<span class="n">y3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">div</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">y3</span>
</code></pre></div></div>
<p>Think of the framework as an entity which is solely responsible for doing all the derivative computations. You just can’t be careless to use <code class="language-plaintext highlighter-rouge">math.sum()</code> (or anything) instead <code class="language-plaintext highlighter-rouge">torch.sum()</code>, or omit the base class <code class="language-plaintext highlighter-rouge">torch.nn.Module</code>. You have to stick to the guidelines <code class="language-plaintext highlighter-rouge">PyTorch</code> laid out to be able to make use of it. When done with the definition, we can run forward and backward pass like using actual data \((a_0, b_0)\)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>model = Network(...)
F = model(a0)
F.backward()
# 'model.b0.grad' & 'a0.grad' available
</code></pre></div></div>
<p>This will cause the framework to trigger the following sequence of computations one after another</p>
\[\tag{2}
\begin{align}
y_1, \mathcal{B}_1 &← \mathcal{J}(\mathrm{torch.pow}, \mathbf{b_0}, 2) \\
y_2, \mathcal{B}_2 &← \mathcal{J}(\mathrm{torch.sum}, 1, y_1) \\
y_3, \mathcal{B}_3 &← \mathcal{J}(\mathrm{torch.div}, \mathbf{a_0}, y_2) \\
\left[ \overline{a}\right]_{a=\mathbf{a_0}}, \overline{y_2} &\leftarrow \mathcal{B}_3(1) \\
\overline{y_1} &\leftarrow \mathcal{B}_2(\overline{y}_2) \\
\left[ \overline{b}\right]_{b=\mathbf{b_0}} &\leftarrow \mathcal{B}_1(\overline{y}_1)
\end{align}\]
<p>The first and last 3 lines of computation are the “forward pass” and the “backward pass” of the model respectively. Frameworks like <code class="language-plaintext highlighter-rouge">PyTorch</code> and <code class="language-plaintext highlighter-rouge">Tensorflow</code> typically works in this way when <code class="language-plaintext highlighter-rouge">.forward()</code> and <code class="language-plaintext highlighter-rouge">.backward()</code> calls are made in succession. Point to be noted that since we are explicitly executing a forward pass, it will cache the necessary values required for executing the pullbacks in the backward pass. An overall diagram is shown below for clarification.</p>
<center>
<figure>
<img width="50%" style="padding-top: 20px;" src="/public/posts_res/18/tape_based.png" />
<figcaption>Fig.1: Overall pipeline of tape-based backpropagation. Green arrows indicate pullback creation by the framework and magenta arrows denote the runtime execution flow. </figcaption>
</figure>
</center>
<h3 id="whats-the-problem-">What’s the problem ?</h3>
<p>As of now, it might not seem that big of a problem for regular PyTorch user (me included). The problem intensifies when you have a non-ML code base with a complicated physics model (for example) like this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">from</span> <span class="nn">other_part_of_my_model</span> <span class="kn">import</span> <span class="n">sub_part</span>
<span class="k">def</span> <span class="nf">helper_function</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="n">something</span><span class="p">:</span>
<span class="k">return</span> <span class="n">helper_function</span><span class="p">(</span><span class="n">sub_part</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="c1"># recursive call
</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">complex_physics_model</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="n">math</span><span class="p">.</span><span class="n">do_awesome_thing</span><span class="p">(</span><span class="n">parameters</span><span class="p">,</span> <span class="n">helper_function</span><span class="p">(</span><span class="nb">input</span><span class="p">))</span>
<span class="p">...</span>
<span class="k">return</span> <span class="n">output</span>
</code></pre></div></div>
<p>.. and you want to use it within your <code class="language-plaintext highlighter-rouge">PyTorch</code> model and differentiate it. There is no way you can do this so easily without spending your time to <code class="language-plaintext highlighter-rouge">PyTorch</code>-ify it first.</p>
<p>There is another serious problem with this approach: the framework cannot “<em>see</em>” any computation ahead of time. For example, when the execution thread reaches the <code class="language-plaintext highlighter-rouge">torch.sum()</code> function, it has no idea that it is about to encounter <code class="language-plaintext highlighter-rouge">torch.div()</code>. The reason its important is because the framework has no way of optimizing the computation - it <em>has to</em> execute the exact sequence of computations verbatim. For example, if the function description is given as \(\displaystyle{ F(a, b) = \frac{(a + ab)}{(1 + b)} }\), this type of framework will waste its resources executing lots of operations which will ultimately yield (both in forward and backward direction) something trivial.</p>
<h2 id="2-differentiable-programming">2. Differentiable Programming</h2>
<p>Differentiable Programming (DiffProg) offers a very elegant solution to both the problems described in the previous section. <strong>DiffProg allows you to write arbitrary code <em>without following any guidelines</em> and still be able to differentiate it.</strong> At the current state of DiffProg, majority of the successful systems use something called “<em>source code transformation</em>” in order to achieve its objective.</p>
<p>Source code transformation is a technique used extensively in the field of Compiler Designing. It takes a piece of code written in some high-level language (like C++, Python etc.) and emits a <em>compiled</em> version of it typically in a relatively lower level language (like Assembly, Bytecode, IRs etc.). Specifically, the input to a DiffProg system is a description of \(y ← F(\Theta)\) as <em>source code</em> written in some language with defined input/output. The output of the system is the source code of the derivative of \(F(\Theta)\) w.r.t its inputs (i.e., \(\overline{\Theta} ← F'(\overline{y})\)). The input program has full liberty to use the native primitives of the programming language like built-in functions, conditional statements, recursion, <code class="language-plaintext highlighter-rouge">struct</code>-like facilities, memory read/write constructs and pretty much anything that the language offers.</p>
<p>Using our generic notation, we can write down such a system as</p>
\[y, \mathcal{B} \leftarrow \mathcal{J}(F, \Theta)\]
<p>where \(F\) and \(\mathcal{B}:\overline{y}\rightarrow \overline{\Theta}\) are the given function and its derivative function in the form of <em>source codes</em> (bare with me if it doesn’t make sense at this point). Just like before, the <em>source code</em> for pullback \(\mathcal{B}\) may require some intermediate variables from that of \(y\). For a concrete example, the following is be a (hypothetical) valid DiffProg system:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="c1"># following string contains the source code of F(.)
</span><span class="o">>>></span> <span class="n">input_prog</span> <span class="o">=</span> <span class="s">"""
def F(a, b):
y1 = b ** 2
y2 = 1 + y1
return a / y2
"""</span>
<span class="o">>>></span> <span class="n">y</span><span class="p">,</span> <span class="n">B</span> <span class="o">=</span> <span class="n">diff_prog</span><span class="p">(</span><span class="n">input_prog</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="mf">1.</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="mf">2.</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="mf">0.2</span>
<span class="o">>>></span> <span class="k">exec</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="c1"># get the derivative function as a live object in current session
</span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">dF</span><span class="p">(</span><span class="mf">1.</span><span class="p">))</span> <span class="c1"># 'df()' is defined in source code 'B'
</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.16</span><span class="p">)</span>
</code></pre></div></div>
<p>Please pay attention to the fact that both our problems discussed in tape-based system are effectively solved now:</p>
<ol>
<li>
<p>We no longer need to be under the umbrella of a framework as we can directly work with native code. In the above example, the source code of the given function is simply written in native python. The example shows the overall pullback source-code (i.e., <code class="language-plaintext highlighter-rouge">B</code>) and also its explicitly compiled form (i.e., <code class="language-plaintext highlighter-rouge">dF</code>). Optionally, a DiffProg system can produce readily compiled derivative function.</p>
</li>
<li>
<p>The DiffProg system can “see” the whole source-code at once, giving it the opportunity to run various optimizations. As a result, both the given program the derivative program could be much faster than the standard tape-based approaches.</p>
</li>
</ol>
<p>Although I showed the examples in Python for ease of understanding but it doesn’t really have to be Python. The theory of DiffProg is very general and can be adopted to any language. In fact, Python is NOT the language of choice for some of the first successful DiffProg systems. The one we are gonna talk about is written in a relatively new language called <a href="http://julialang.org/">Julia</a>. The Julia language and its compiler provides an excellent support for meta-programming, i.e. manipulating/analysing/constructing Julia programs within itself. This allows Julia to offer a DiffProg system that is much more flexible than naively parsing strings (like the toy example shown above). We will look into the specifics of the Julia language and its DiffProg framework called “<a href="https://fluxml.ai/Zygote.jl/latest/">Zygote</a>” later in this article. But before that, we will look at few details about the general compiler machinery that is required to implement DiffProg systems. Since this article is mostly targetted towards people from ML/DL background, I will try my best to be reasonable about the details of compiler designing.</p>
<h3 id="static-single-assignment-ssa-form">Static Single Assignment (SSA) form</h3>
<p>A compiler (or compiler-like system) analyses a given source code by parsing it as string. Then, it creates a large and complex data structure (known as AST) containing control flow, conditionals and every fundamental language constructs. Such structure is further compiled down to a relatively low-level representation comprising the core flow of a source program. This low-level code is known as the “Intermediate Representation (IR)”.
One of its fundamental purpose is to replaces all unique variable names with a unique ID. A given source code like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(a, b)
y1 = b ^ 2
y1 = 1 + y1
return a / y1
</code></pre></div></div>
<p>a compiler can turn it into an IR (hypothetical) like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%3 = 1 + %3
return %1 / %3
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">%N</code> is a unique placeholder for a variable. However, this particular form is a little inconvenient to analyse in practice due to the possibility of a symbol redefinition (e.g. the variable <code class="language-plaintext highlighter-rouge">y1</code> in above example). Modern compilers (including Julia) use a little improved IR, called “<em>SSA (Static Single Assignment) form IR</em>” which assigns one variable only once and often introduces extra unique symbols in order to achieve that.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function F(%1, %2)
%3 = %2 ^ 2
%4 = 1 + %3
return %1 / %4
</code></pre></div></div>
<p>Please note how it used an extra unique ID (i.e. <code class="language-plaintext highlighter-rouge">%4</code>) in order to avoid re-assignment (of <code class="language-plaintext highlighter-rouge">%3</code>).
It has been shown that such SSA-form IR (rather than direct source code) can be differentiated, and a corresponding “Derivative IR” can be retrieved. The obvious way of crafting the derivative IR of \(F\) is to use the Derivative IRs of its constituent operations, similar to what is done in tape-based method. The biggest difference is the fact that everything is now in terms of source codes (or rather IR to be precise). The compiler could craft the derivative program like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function dF(%1, %2)
# IR for forward pass
%3, B1 = J(pow, %2, 2)
%4, B2 = J(add, 1, %3)
_, B3 = J(div, %1, %4)
# IR for backward pass
%5, %6 = B3(1)
%7 = B2(%6)
%8 = B1(%7)
return %5, %8
</code></pre></div></div>
<p>The structure of the above code may resemble the sequence of computations in Eq.2, but its very different in terms of implementation (Refer to Fig.2 below). The code (IR) is constructed at compile time by a compiler-like framework (the DiffProg system). The derivative IR is then passed onto an IR-optimizer which can squeeze its guts by enabling various optimization like dead-code elimination, operation-fusion and more advanced ones. And finally compiling it down to machine code.</p>
<center>
<figure>
<img width="90%" style="padding-top: 20px;" src="/public/posts_res/18/diff_prog.png" />
<figcaption>Fig.2: Overall pipeline of a DiffProg system with source-code transformation. Green arrows indicate creation of pullback codes by the framework and magenta arrows denote composition of source code blocks. After compiler optimization, the whole source code is squeezed into highly efficient object code. </figcaption>
</figure>
</center>
<h2 id="zygote-a-diffprog-framework-for-julia"><code class="language-plaintext highlighter-rouge">Zygote</code>: A DiffProg framework for Julia</h2>
<p>Julia is a particularly interesting language when it comes to implementing a DiffProg framework. There are solid reasons why <code class="language-plaintext highlighter-rouge">Zygote</code>, one of the most successful DiffProg frameworks is written in Julia. I will try to articulate few of them below:</p>
<ol>
<li>
<p><strong>Just-In-Time (JIT) compiler:</strong> Julia’s efficient Just-in-time (JIT) compiler compiles one statement at a time and run it immediately before moving on to the next one, achieving a striking balance between interpreted and compiled languages.</p>
</li>
<li>
<p><strong>Type specialization:</strong> Julia allows writing generic/optional/loosely-typed functions that can later be instantiated using concrete types. High-density computations specifically benefit from it by casting every computation in terms of <code class="language-plaintext highlighter-rouge">Float32/Float64</code> which can in turn produce specialized instructions (e.g. <code class="language-plaintext highlighter-rouge">AVX</code>, <code class="language-plaintext highlighter-rouge">MMX</code>, <code class="language-plaintext highlighter-rouge">AVX2</code>) for modern CPUs.</p>
</li>
<li>
<p><strong>Pre-compilation:</strong> The peculiar feature that benefits <code class="language-plaintext highlighter-rouge">Zygote</code> the most is Julia’s nature of keeping track of the compilations that have already been done in the current session and DOES NOT do them again. Since DL/ML is all about computing gradients over and over again, <code class="language-plaintext highlighter-rouge">Zygote</code> compiles and optimizes the derivative program (IR) just once and runs it repeatedly (which is blazingly fast) with different value of parameters.</p>
</li>
<li>
<p><strong>LLVM IR:</strong> Julia uses <a href="https://llvm.org/">LLVM compiler infrastructure</a> as its backend and hence emits the LLVM IR known to be highly performant and used by many other prominent languages.</p>
</li>
</ol>
<p>Now, let’s see <code class="language-plaintext highlighter-rouge">Zygote</code>’s primary API, which is surprisingly simple. The central API of <code class="language-plaintext highlighter-rouge">Zygote</code> is the function <code class="language-plaintext highlighter-rouge">Zygote.gradient(..)</code> with its first argument being the function to be differentiated (written in native Julia) followed by its argument at which gradient to be computed.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="k">using</span> <span class="n">Zygote</span>
<span class="n">julia</span><span class="o">></span> <span class="k">function</span><span class="nf"> F</span><span class="x">(</span><span class="n">x</span><span class="x">)</span>
<span class="k">return</span> <span class="mi">3</span><span class="n">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">2</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">julia</span><span class="o">></span> <span class="n">gradient</span><span class="x">(</span><span class="n">F</span><span class="x">,</span> <span class="mi">5</span><span class="x">)</span>
<span class="x">(</span><span class="mi">32</span><span class="x">,)</span>
</code></pre></div></div>
<p>That is basically computing \(\displaystyle{ \left[ \frac{\partial F}{\partial x} \right]_{x=5} }\) for \(F(x) = 3x^2 + 2x + 1\).</p>
<p>For debugging purpose, we can see the <em>actual</em> LLVM IR code for a given function and its pullback. The actual IR is a bit more complex in reality than what I showed but similar in high-level structure. We can peek into the IR of the above function</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_ir</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">3</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">4</span> <span class="o">=</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">)()</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">literal_pow</span><span class="x">(</span><span class="n">Main</span><span class="o">.:^</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="o">%</span><span class="mi">5</span>
<span class="o">%</span><span class="mi">7</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="o">%</span><span class="mi">2</span>
<span class="o">%</span><span class="mi">8</span> <span class="o">=</span> <span class="o">%</span><span class="mi">6</span> <span class="o">+</span> <span class="o">%</span><span class="mi">7</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">8</span>
</code></pre></div></div>
<p>.. and also its “Adjoint”. The adjoint in Zygote is basically the mathematical functional \(\mathcal{J}(\cdot)\) that we’ve been seeing all along.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="n">Zygote</span><span class="o">.</span><span class="nd">@code_adjoint</span> <span class="n">F</span><span class="x">(</span><span class="mi">5</span><span class="x">)</span>
<span class="n">Zygote</span><span class="o">.</span><span class="kt">Adjoint</span><span class="x">(</span><span class="mi">1</span><span class="o">:</span> <span class="x">(</span><span class="o">%</span><span class="mi">3</span><span class="x">,</span> <span class="o">%</span><span class="mi">4</span> <span class="o">::</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">Context</span><span class="x">,</span> <span class="o">%</span><span class="mi">1</span><span class="x">,</span> <span class="o">%</span><span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">5</span> <span class="o">=</span> <span class="n">Core</span><span class="o">.</span><span class="n">apply_type</span><span class="x">(</span><span class="n">Base</span><span class="o">.</span><span class="kt">Val</span><span class="x">,</span> <span class="mi">2</span><span class="x">)</span>
<span class="o">%</span><span class="mi">6</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">_pullback</span><span class="x">(</span><span class="o">%</span><span class="mi">4</span><span class="x">,</span> <span class="o">%</span><span class="mi">5</span><span class="x">)</span>
<span class="o">...</span>
<span class="c"># please run yourself to see the full code</span>
<span class="o">...</span>
<span class="o">%</span><span class="mi">13</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradindex</span><span class="x">(</span><span class="o">%</span><span class="mi">12</span><span class="x">,</span> <span class="mi">1</span><span class="x">)</span>
<span class="o">%</span><span class="mi">14</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">accum</span><span class="x">(</span><span class="o">%</span><span class="mi">6</span><span class="x">,</span> <span class="o">%</span><span class="mi">10</span><span class="x">)</span>
<span class="o">%</span><span class="mi">15</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">tuple</span><span class="x">(</span><span class="nb">nothing</span><span class="x">,</span> <span class="o">%</span><span class="mi">14</span><span class="x">)</span>
<span class="k">return</span> <span class="o">%</span><span class="mi">15</span><span class="x">)</span>
</code></pre></div></div>
<p>I have established throughout this article that the function \(F(x)\) can literally be any arbitrary program written in native Julia using standard language features.
Let’s see another toy (but meaningful) program using more flexible Julia code.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span><span class="nc"> Point</span>
<span class="n">x</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">y</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">Point</span><span class="x">(</span><span class="n">x</span><span class="o">::</span><span class="kt">Real</span><span class="x">,</span> <span class="n">y</span><span class="o">::</span><span class="kt">Real</span><span class="x">)</span> <span class="o">=</span> <span class="n">new</span><span class="x">(</span><span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">x</span><span class="x">),</span> <span class="n">convert</span><span class="x">(</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">y</span><span class="x">))</span>
<span class="k">end</span>
<span class="c"># Define operator overloads for '+', '*', etc.</span>
<span class="k">function</span><span class="nf"> distance</span><span class="x">(</span><span class="n">p₁</span><span class="o">::</span><span class="n">Point</span><span class="x">,</span> <span class="n">p₂</span><span class="o">::</span><span class="n">Point</span><span class="x">)</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">δp</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">-</span> <span class="n">p₂</span>
<span class="n">norm</span><span class="x">([</span><span class="n">δp</span><span class="o">.</span><span class="n">x</span><span class="x">,</span> <span class="n">δp</span><span class="o">.</span><span class="n">y</span><span class="x">])</span>
<span class="k">end</span>
<span class="n">p₁</span><span class="x">,</span> <span class="n">p₂</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span> <span class="mf">3.0</span><span class="x">),</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mf">2.</span><span class="x">,</span> <span class="mi">0</span><span class="x">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Point</span><span class="x">(</span><span class="o">-</span><span class="mi">1</span><span class="o">//</span><span class="mi">3</span><span class="x">,</span> <span class="mf">1.0</span><span class="x">)</span>
<span class="c"># initial parameters</span>
<span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span> <span class="o">=</span> <span class="mf">0.1</span><span class="x">,</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="o">:</span><span class="mi">1000</span> <span class="c"># no. of epochs</span>
<span class="c"># compute gradients</span>
<span class="nd">@time</span> <span class="n">δK₁</span><span class="x">,</span> <span class="n">δK₂</span> <span class="o">=</span> <span class="n">Zygote</span><span class="o">.</span><span class="n">gradient</span><span class="x">(</span><span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span><span class="x">)</span> <span class="k">do</span> <span class="n">k₁</span><span class="o">::</span><span class="kt">Float64</span><span class="x">,</span> <span class="n">k₂</span><span class="o">::</span><span class="kt">Float64</span>
<span class="n">p̂</span> <span class="o">=</span> <span class="n">p₁</span> <span class="o">*</span> <span class="n">k₁</span> <span class="o">+</span> <span class="n">p₂</span> <span class="o">*</span> <span class="n">k₂</span>
<span class="n">distance</span><span class="x">(</span><span class="n">p̂</span><span class="x">,</span> <span class="n">p</span><span class="x">)</span> <span class="c"># scalar output of the function</span>
<span class="k">end</span>
<span class="c"># update parameters</span>
<span class="n">K₁</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₁</span>
<span class="n">K₂</span> <span class="o">-=</span> <span class="mf">1e-3</span> <span class="o">*</span> <span class="n">δK₂</span>
<span class="k">end</span>
<span class="nd">@show</span> <span class="n">K₁</span><span class="x">,</span> <span class="n">K₂</span>
<span class="c"># shows "(K₁, K₂) = (0.33427804653861276, 0.4996408206795386)"</span>
</code></pre></div></div>
<p>The above program is basically solving the following (pretty simple) problem</p>
\[\begin{align}
\arg\min_{K_1,K_2} &\vert\vert \widehat{p}(K_1,K_2) - p \vert\vert_2 \\
\text{with }&\widehat{p}(K_1,K_2) ≜ p_1 \cdot K_1 + p_2 \cdot K_2
\end{align}\]
<p>where \(p=[-1/3, 1]^T, p_1=[2,3]^T\) and \(p_2=[-2,0]^T\). By choosing these specific numbers, I guaranteed that there is a solution for \(K_1,K_2\).</p>
<p>If you look at the program at a glance, you would notice that the whole program is almost entirely written in native Julia using structure (<code class="language-plaintext highlighter-rouge">struct Point</code>), built-in function (<code class="language-plaintext highlighter-rouge">norm()</code>, <code class="language-plaintext highlighter-rouge">convert()</code>), memory access constructs (<code class="language-plaintext highlighter-rouge">δp.x</code>, <code class="language-plaintext highlighter-rouge">δp.y</code>) etc. The only usage of Zygote is that <code class="language-plaintext highlighter-rouge">Zygote.gradient()</code> call in the heart of the loop. BTW, I omitted the operator overloading functions for space restrictions.</p>
<p>I am not showing the IR codes for this one; you are free to execute <code class="language-plaintext highlighter-rouge">@code_ir</code> and <code class="language-plaintext highlighter-rouge">@code_adjoint</code> on the function implicitly defined using the <code class="language-plaintext highlighter-rouge">do .. end</code>. One thing I would like to show is the execution speed and my earlier argument about “precompilation”. The time measuring macro (<code class="language-plaintext highlighter-rouge">@time</code>) shows this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 11.764279 seconds (26.50 M allocations: 1.342 GiB, 4.58% gc time)
0.000025 seconds (44 allocations: 2.062 KiB)
0.000026 seconds (44 allocations: 2.062 KiB)
0.000007 seconds (44 allocations: 2.062 KiB)
0.000006 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
</code></pre></div></div>
<p>Did you see how the execution time reduced by an astonishingly high margin ? That’s Julia’s precompilation at work - it compiled the derivative program only once (on its first encounter) and produces highly efficient code to be reused later. It might not be as big a surprise if you already know Julia, but it is definitely a huge advantage for a DiffProg framework.</p>
<hr />
<p>Okay, that’s about it today. See you next time. The following references have been used for preparing this article:</p>
<ol>
<li>“Don’t Unroll Adjoint: Differentiating SSA-Form Programs”, Michael Innes, <a href="https://arxiv.org/abs/1810.07951">arXiv/1810.07951</a>.</li>
<li><a href="https://www.youtube.com/watch?v=LjWzgTPFu14">Talk</a> by Michael Innes @ Julia london user group meetup.</li>
<li><a href="https://www.youtube.com/watch?v=Sv3d0k7wWHk">Talk</a> by Elliot Saba & Viral Shah @ Microsoft research.</li>
<li><a href="https://fluxml.ai/Zygote.jl/latest/">Zygote’s documentation</a> & <a href="https://docs.julialang.org/en/v1/">Julia’s documentation</a>.</li>
</ol>Ayan DasIf you are follwoing the recent developments in the field of Deep Learning, you might recognize this new buzz-word, “Differentiable Programming”, doing rounds on social media (including prominent researchers like Yann LeCun, Andrej Karpathy) for an year or two. Differentiable Programming (let’s shorten it as “DiffProg” for the rest of this article) is essentially a system proposed as an alternative to tape-based Backpropagation which is running a recorder (often called “Tape”) that builds a computation graph at runtime and propagates error signal from end towards the leaf-nodes (typically weights and biases). DiffProg is very different from an implementation perspective - it doesn’t really “propagate” anything. It consumes a “program” in the form of source code and produces the “Derivative program” (also source code) w.r.t its inputs without ever actually running either of them. Additionally, DiffProg allows users the flexibility to write arbitrary programs without constraining it to any “guidelines”. In this article, I will describe the difference between the two methods in theoretical as well as practical terms. We’ll look into one successful DiffProg implementation (named “Zygote”, written in Julia) gaining popularity in the Deep Learning community.Energy Based Models (EBMs): A comprehensive introduction2020-08-13T00:00:00+00:002020-08-13T00:00:00+00:00https://ayandas.me/blog-tut/2020/08/13/energy-based-models-one<p>We talked extensively about <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a> in my earlier article and also described <a href="/blog-tut/2020/01/01/variational-autoencoder.html">one particular model</a> following the principles of Variational Inference (VI). There exists another class of models conveniently represented by <em>Undirected</em> Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as <strong>Energy Based Models (EBM)</strong>, as we shall see, they rely on something called <em>Energy Functions</em>. In the early days of this Deep Learning <em>renaissance</em>, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as <strong>Boltmann Machines (BM)</strong> which are very well known in the literature.</p>
<h2 id="undirected-graphical-models">Undirected Graphical Models</h2>
<p>The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGMs</a>. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.</p>
<center>
<figure>
<img width="65%" style="padding-top: 20px;" src="/public/posts_res/17/undir_models.png" />
<figcaption>Fig.1: (a) An atom lattice model. (b) An arbitrary undirected model.</figcaption>
</figure>
</center>
<p>We model a set of random variables \(\mathbf{X}\) (in our example, \(\{ A,B,C,D \}\)) whose connections are defined by graph \(\mathcal{G}\) and have <em>“potential functions”</em> defined on each of its maximal <a href="https://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a> \(\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})\). The total potential of the graph is defined as</p>
<p>\[
\Phi(\mathbf{x}) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q)
\]</p>
<p>\(q\) is an arbitrary instantiation of the set of RVs denoted by \(\mathcal{Q}\). The potential functions \(\phi_{\mathcal{Q}}(q)\in\mathbb{R}_{>0}\) are basically “affinity” functions on the state space of the cliques, e.g. given a state \(q\) of a clique \(\mathcal{Q}\), the corresponding potential function \(\phi_{\mathcal{Q}}(q)\) returns the <em>viability of that state</em> OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are <em>arbitrary non-negative values</em>. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as \(\displaystyle{ \Phi(a,b,c,d) = \phi_{\{A,B,C\}}(a,b,c)\cdot \phi_{\{A,D\}}(a,d) }\). If we assume the variables \(\{ A,D \}\) are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:</p>
\[\phi_{\{A,D\}}(a=0,d=0) = +4.00 \\
\phi_{\{A,D\}}(a=0,d=1) = +0.23 \\
\phi_{\{A,D\}}(a=1,d=0) = +5.00 \\
\phi_{\{A,D\}}(a=1,d=1) = +9.45\]
<p>Just like every other model in machine learning, the potential functions can be parameterized, leading to</p>
<p>\[ \tag{1}
\Phi(\mathbf{x}; \Theta) = \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \phi_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})
\]</p>
<p>Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.</p>
<h2 id="reparameterizing-in-terms-of-energy">Reparameterizing in terms of Energy</h2>
<p>When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of <strong>energy</strong> functions \(E_{\mathcal{Q}}\) where</p>
<p>\[\tag{2}
\phi_{\mathcal{Q}}(q, \Theta_{\mathcal{Q}}) = \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}))
\]</p>
<p>The \(\exp(\cdot)\) enforces the potentials to be always non-negative and thus we are free to choose an <em>unconstrained</em> energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is <em>reverts the semantic meaning</em> of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are <em>bad</em>, i.e. less energy means a stable system.</p>
<p>Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields</p>
\[\begin{align}
\Phi(\mathbf{x}; \Theta) &= \prod_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} \exp(-E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})) \\
\tag{3}
&= \exp\left(-\sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}})\right) =
\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta))
\end{align}\]
<p>Here we defined \({ E_{\mathcal{G}}(\mathbf{x}; \Theta) \triangleq \sum_{\mathcal{Q}\in\mathrm{Cliques}(\mathcal{G})} E_{\mathcal{Q}}(q; \Theta_{\mathcal{Q}}) }\) to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph <em>from multiplicative (Eq.1) to additive (Eq.3)</em>. This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.</p>
<p>All this is fine .. well .. unless we need to do things like <em>sampling</em>, <em>computing log-likelihood</em> etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.</p>
<h2 id="back-to-probabilities">Back to Probabilities</h2>
<p>The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain</p>
\[\begin{align}
p(\mathbf{x}; \Theta) &= \frac{\Phi(\mathbf{x}; \Theta)}{
\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \Phi(\mathbf{x}'; \Theta)
} \\ \\
\tag{4}
&= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}; \Theta)/\tau)}{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}(\mathbf{x}'; \Theta)/\tau)}\text{ (using Eq.3)}
\end{align}\]
<p>This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss \(\tau\) shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to <code class="language-plaintext highlighter-rouge">Boltzmann Distribution</code>. Here’s what the <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Wikipedia</a> says:</p>
<blockquote>
<p>In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …</p>
</blockquote>
<p>From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px;" src="/public/posts_res/17/energy_prob.png" />
<figcaption>Fig.2: An energy function and its corresponding probability distribution.</figcaption>
</figure>
</center>
<p>The denominator of Eq.4 is often known as the “Partition Function” (denoted as \(Z\)). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of \(\mathbf{X}\).</p>
<p>A hyper-parameter called “temperature” (denoted as \(\tau\)) is often introduced in Eq.4 which also has its roots in the original <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann Distribution from Physics</a>. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider \(\tau=1\) for the rest of the article.</p>
<h2 id="a-general-learning-algorithm">A general learning algorithm</h2>
<p>The question now is - how do I learn the model given a dataset ? Let’s say my dataset has \(N\) samples: \(\mathcal{D} = \{ x^{(i)} \}_{i=1}^N\). An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset</p>
\[\begin{align}
\mathcal{L}(\Theta; \mathcal{D}) = - \log \prod_{i=1}^N p(x^{(i)}; \Theta) &= \sum_{i=1}^N -\log p(x^{(i)}; \Theta) \\
&= \underbrace{\frac{1}{N}\sum_{i=1}^N}_{\text{expectation}} \left[ E_{\mathcal{G}}(x^{(i)}; \Theta) \right] + \log Z\\
&\text{(putting Eq.4 followed by trivial calculations, and}\\
&\text{ dividing loss by constant N doesn't affect optima)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\bigl[ E_{\mathcal{G}}(x; \Theta) \bigr] + \log Z
\end{align}\]
<p>Computing gradient w.r.t. parameters yields</p>
\[\begin{align}
\frac{\partial \mathcal{L}}{\partial \Theta} &= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{\partial}{\partial \Theta} \log Z \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \frac{1}{Z} \frac{\partial}{\partial \Theta} \left[ \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \exp(-E_{\mathcal{G}}) \right]\text{ (using definition of Z)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] + \sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} \underbrace{\frac{1}{Z} \exp(-E_{\mathcal{G}})}_{\text{RHS of Eq.4}} \cdot \frac{\partial (-E_{\mathcal{G}})}{\partial \Theta}\\
&\text{ (Both Z and the partial operator are independent}\\
&\text{ of x and can be pushed inside the summation)}\\ \\
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \underbrace{\sum_{\mathbf{x}'\in\mathrm{Dom}(\mathbf{X})} p(\mathbf{x}'; \Theta)}_{\text{expectation}} \cdot \frac{\partial E_{\mathcal{G}}}{\partial \Theta}\\
\tag{5}
&= \mathbb{E}_{x\sim p_{\mathcal{D}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right] - \mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]
\end{align}\]
<p>Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the <em>data distribution</em> - essentially picking up data from our dataset. The second expectation is on the <em>model distribution</em> - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule</p>
\[\Theta := \Theta - \lambda\cdot\mathbb{E}_{x\sim\mathcal{D}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\text{, and }
\Theta := \Theta + \lambda\cdot\mathbb{E}_{x\sim\mathcal{p_{\Theta}}}\left[ \frac{\partial E_{\mathcal{G}}}{\partial \Theta} \right]\]
<p>The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points <em>coming from data</em>. And the second one tries to maximize (notice the difference in sign) the energy function at points <em>coming from the model</em>. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. \(p_{\Theta} \approx p_{\mathcal{D}}\). At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm <em>pushes the energy down</em> at places where original data lies; it also <em>pull the energy up</em> at places which the <em>model thinks</em> original data lies.</p>
<center>
<figure>
<img width="95%" style="padding-top: 20px;" src="/public/posts_res/17/pos_neg_phase_diagram.png" />
<figcaption>Fig.3: (a) Model is being optimized. The arrows depict the "pulling up" and "pushing down" of energy landscape. (b) Model has converged to an optimum.</figcaption>
</figure>
</center>
<p>Whatever may be the interpretation, as I mentioned before that the denominator of \(p(\mathbf{x}; \Theta)\) (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.</p>
<h2 id="gibbs-sampling">Gibbs Sampling</h2>
<p>As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the <em>conditional densities</em> of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the \(Z\) cancels out. Conditional density of one variable (say \(X_j\)) given the others (let’s denote by \(X_{-j}\)) is:</p>
\[\tag{6}
p(x_j\vert \mathbf{x}_{-j}) = \frac{p(\mathbf{x})}{p(\mathbf{x}_{-j})}
= \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}))}{\sum_{x_j} \exp(-E_{\mathcal{G}}(\mathbf{x}))}
\text{ (using Eq.4)}\]
<p>I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.</p>
<p>To sample \(\mathbf{x}\sim p_{\Theta}(\mathbf{x})\), we iteratively execute the following for \(T\) iterations</p>
<ol>
<li>We have a sample from last iteration \(t-1\) as \(\mathbf{x}^{(t-1)}\)</li>
<li>We then pick one variable \(X_j\) (in some order) at a time and sample from its conditional given the others: \(x_j^{(t)}\sim p(x_j\vert \underbrace{x_1^{(t)}, \cdots, x_{j-1}^{(t)}}_{\text{current iteration}}, \underbrace{x_{j+1}^{(t-1)}, \cdots, x_D^{(t-1)}}_{\text{previous iteration}})\). Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.</li>
</ol>
<p>We can start this process by setting \(\mathbf{x}^{(0)}\) to anything. If \(T\) is sufficiently large, the samples towards the end are true samples from the density \(p_{\Theta}\). To know it a bit more rigorously, I <strong>highly recommend</strong> to <a href="https://en.wikipedia.org/wiki/Gibbs_sampling#Implementation">go through this</a>.
You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">MCMC family algorithm</a> which has something called “Burn-in period”.</p>
<p>Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.</p>
<h2 id="boltzmann-machine">Boltzmann Machine</h2>
<p>Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector \(\mathbf{X}\in\{0,1\}^D\) with \(D\) components \([ X_1, X_2, \cdots, X_D ]\). All \(D\) RVs are connected to all others by an undirected graph \(\mathcal{G}\).</p>
<center>
<figure>
<img width="30%" style="padding-top: 20px;" src="/public/posts_res/17/bm_diagram.png" />
<figcaption>Fig.4: Undirected graph representing Boltzmann Machine</figcaption>
</figure>
</center>
<p>By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:</p>
\[\tag{7}
E_{\mathcal{G}}(\mathbf{x}; W) = - \frac{1}{2} \mathbf{x}^T W \mathbf{x}\]
<p>Upon expanding the vectorized form (reader is encouraged to try), we can see each term \(x_i\cdot W_{ij}\cdot x_j\) for all \(i\lt j\) as the contribution of pair of RVs \((X_i, X_j)\) to the whole energy function. \(W_{ij}\) is the “connection strength” between them. If a pair of RVs \((X_i, X_j)\) turn on together more often, a high value of \(W_{ij}\) would encourage reducing the total energy. So by means of learning, we expect to see \(W_{ij}\) going up if \((X_i, X_j)\) fire together. This phenomenon is the founding idea of a closely related learning strategy called <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian Learning</a>, proposed by Donald Hebb. Hebbian theory basically says:</p>
<blockquote>
<p>If fire together, then wire together</p>
</blockquote>
<p>How do we learn this model then ? We have already seen the general way of computing gradient. We have \(\displaystyle{ \frac{\partial E_{\mathcal{G}}}{\partial W} = -\mathbf{x}\mathbf{x}^T }\). So let’s use Eq.5 to derive a learning rule:</p>
\[W := W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{W})}[ -\mathbf{x}\mathbf{x}^T ] \right)\]
<p>Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:</p>
\[p(x_j = 1\vert \mathbf{x}_{-j}; W) = \sigma\left(W_{-j}^T\cdot \mathbf{x}_{-j}\right)\]
<p>where \(\sigma(\cdot)\) is the Sigmoid function and \(W_{-j}\in\mathbb{R}^{D-1}\) denote the vector of parameters connecting \(x_j\) with the rest of the variables \(\mathbf{x}_{-j}\in\mathbb{R}^{D-1}\). I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:</p>
<center>
<figure>
<img width="25%" style="padding-top: 20px;" src="/public/posts_res/17/bm_conditional.png" />
<figcaption>Fig.5: The computational view of BM showing its dependencies by arrows.</figcaption>
</figure>
</center>
<h2 id="boltzmann-machine-with-latent-variables">Boltzmann Machine with latent variables</h2>
<p>To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help <em>explaining</em> the visible variables (see my <a href="/blog-tut/2019/11/20/inference-in-pgm.html">Directed PGM</a> article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).</p>
<p><strong>[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]</strong></p>
<center>
<figure>
<img width="70%" style="padding-top: 20px;" src="/public/posts_res/17/hbm_diagram.png" />
<figcaption>Fig.6: (a) Undirected graph of BM with hidden units (shaded ones are visible). (b) Computational view of the model while computing conditionals. </figcaption>
</figure>
</center>
<p>Suppose we have \(K\) hidden units and \(D\) visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden (\(W\in\mathbb{R}^{D\times K}\)), visible-visible (\(V\in\mathbb{R}^{D\times D}\)) and hidden-hidden (\(U\in\mathbb{R}^{K\times K}\)) interactions. We compactly represent them as \(\Theta \triangleq \{ W, U, V \}\).</p>
\[E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}; \Theta) = -\mathbf{x}^T W \mathbf{h} - \frac{1}{2} \mathbf{x}^T V \mathbf{x} - \frac{1}{2} \mathbf{h}^T U \mathbf{h}\]
<p>The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.</p>
\[p(\mathbf{x}; \Theta) = \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} p(\mathbf{x}, \mathbf{h}; \Theta)
= \sum_{\mathbf{h}\in\mathrm{Dom}(\mathbf{H})} \frac{\exp(-E_{\mathcal{G}}(\mathbf{x}, \mathbf{h}))}{\sum_{\mathbf{x}',\mathbf{h}'\in\mathrm{Dom}(\mathbf{X}, \mathbf{H})} \exp(-E_{\mathcal{G}}(\mathbf{x}', \mathbf{h}'))}\]
<p>It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:</p>
\[\begin{align}
p(h_k\vert \mathbf{x}, \mathbf{h}_{-k}) = \sigma( W\cdot\mathbf{x} + U_{-k}\cdot\mathbf{h}_{-k} ) \\
p(x_j\vert \mathbf{h}, \mathbf{x}_{-j}) = \sigma( W\cdot\mathbf{h} + V_{-j}\cdot\mathbf{x}_{-j} )
\end{align}\]
<p>Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.</p>
<p>Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters</p>
\[\begin{align}
W &:= W - \lambda \cdot \left( \mathbb{E}_{\mathbf{x,h}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{x,h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{h}^T ] \right)\\
V &:= V - \lambda \cdot \left( \mathbb{E}_{\mathbf{x}\sim p_{\mathcal{D}}}[ -\mathbf{x}\mathbf{x}^T ] - \mathbb{E}_{\mathbf{x}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{x}\mathbf{x}^T ] \right)\\
U &:= U - \lambda \cdot \left( \mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}[ -\mathbf{h}\mathbf{h}^T ] - \mathbb{E}_{\mathbf{h}\sim \mathrm{Gibbs}(p_{\Theta})}[ -\mathbf{h}\mathbf{h}^T ] \right)
\end{align}\]
<p>If you are paying attention, you might notice something strange .. how do we compute the terms \(\mathbb{E}_{\mathbf{h}\sim p_{\mathcal{D}}}\) (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors \(\mathbf{x}^{(i)}\) in dataset and we can get an approximate <em>complete data</em> (visible plus hidden) density as</p>
\[p_{\mathcal{D}}(\mathbf{x}^{(i)}, \mathbf{h}) = p_{\mathcal{D}}(\mathbf{x}^{(i)}) \cdot p_{\Theta}(\mathbf{h}\vert \mathbf{x}^{(i)})\]
<p>Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).</p>
<p>For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. <strong>There is a clever hack though</strong>. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “<a href="https://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf">Contrastive Divergence</a>” and has long been used in practical implementations.</p>
<h2 id="restricted-boltzmann-machine-rbm">“Restricted” Boltzmann Machine (RBM)</h2>
<p>Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.</p>
<p>RBM is basically same as Boltzmann Machine with hidden units but with <em>one big difference</em> - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.</p>
\[U = \mathbf{0}, V = \mathbf{0}\]
<p>If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of \(U\) and \(V\) from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/17/rbm_diag_and_cond.png" />
<figcaption>Fig.7: (a) Graphical diagram of RBM. (b) Arrows just show computation deps</figcaption>
</figure>
</center>
<p>Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.</p>
\[\begin{align}
p(h_k\vert \mathbf{x}) = \sigma( W_{[:,k]}\cdot\mathbf{x} )\\
p(x_j\vert \mathbf{h}) = \sigma( W_{[j,:]}\cdot\mathbf{h} )
\end{align}\]
<p>That means they can be computed in parallel</p>
\[\begin{align}
p(\mathbf{h}\vert \mathbf{x}) = \prod_{k=1}^K p(h_k\vert \mathbf{x}) = \sigma( W\cdot\mathbf{x} )\\
p(\mathbf{x}\vert \mathbf{h}) = \prod_{j=i}^D p(x_j\vert \mathbf{h}) = \sigma( W\cdot\mathbf{h} )
\end{align}\]
<p>Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:</p>
<ol>
<li>Sample a hidden vector \(\mathbf{h}^{(t)}\sim p(\mathbf{h}\vert \mathbf{x}^{(t-1)})\)</li>
<li>Sample a visible vector \(\mathbf{x}^{(t)}\sim p(\mathbf{x}\vert \mathbf{h}^{(t)})\)</li>
</ol>
<p>This makes RBM an attractive choice for practical implementation.</p>
<hr />
<p>Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://www.cs.toronto.edu/~hinton/csc321/readings/boltz321.pdf">Boltzmann Machine, by G. Hinton, 2007</a></li>
<li><a href="https://www.crim.ca/perso/patrick.kenny/BMNotes.pdf">Notes on Boltzmann Machine, by Patrick Kenny</a></li>
<li><a href="http://deeplearning.net/tutorial/rbm.html">deeplearning.net documentation</a></li>
<li><a href="https://www.youtube.com/watch?v=2fRnHVVLf1Y&list=PLiPvV5TNogxKKwvKb1RKwkq2hm7ZvpHz0">Hinton’s coursera course</a></li>
<li><a href="https://www.deeplearningbook.org/">Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville</a></li>
</ol>Ayan DasWe talked extensively about Directed PGMs in my earlier article and also described one particular model following the principles of Variational Inference (VI). There exists another class of models conveniently represented by Undirected Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as Energy Based Models (EBM), as we shall see, they rely on something called Energy Functions. In the early days of this Deep Learning renaissance, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as Boltmann Machines (BM) which are very well known in the literature.Pixelor: A Competitive Sketching AI Agent. So you think you can sketch?2020-07-30T00:00:00+00:002020-07-30T00:00:00+00:00https://ayandas.me/pubs/2020/07/30/pub-8<center>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/pdf/10.1145/3414685.3417840">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
<a target="_blank" class="pubicon" href="https://dl.acm.org/doi/abs/10.1145/3414685.3417840">
<i class="fa fa-files-o fa-3x"></i>Suppl.
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Demo</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="4">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/8.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my SIGGRAPH Asia 2020 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/159a510c082643ea89a012555fdfcc67" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk (15 mins) at SIGGRAPH Asia 2020</h2>
<iframe width="800" height="450" src="https://www.youtube.com/embed/oSk2x5HuCA8" frameborder="1" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<br />
<h4>Watch a <a href="https://www.youtube.com/watch?v=E_Aclms4g-w" target="_blank">short summary</a> video instead</h4>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<h2><a href="http://surrey.ac:9999/">Try out the Demo</a> (screenshot below)</h2>
<figure>
<img width="75%" src="/public/pub_res/8_2.gif" />
</figure>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="4">
<div class="accordion-item__container">
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/neuralsort-siggraph">
<i class="fa fa-file-code-o fa-3x"></i>NeuralSort repo
</a>
<a target="_blank" class="pubicon" href="https://github.com/AyanKumarBhunia/sketch-transformerMMD">
<i class="fa fa-file-code-o fa-3x"></i>Transformer MMD repo
</a>
<br />
<h2>The "SlowSketch" dataset</h2>
<img border="2px" width="80%" src="/public/pub_res/8_3.png" alt="SlowSketch" />
<a target="_blank" class="pubicon" href="https://drive.google.com/u/0/uc?export=download&confirm=n4LZ&id=1mWEY7vFkOw790DwUtqcTX8fHzNBP_b1J">
<i class="fa fa-database fa-3x"></i>SlowSketch
</a>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{pixelor20siga,
author = {Bhunia, Ayan Kumar and Das, Ayan and Muhammad, Umar Riaz and Yang, Yongxin and Hospedales, Timothy M. and Xiang, Tao and Gryaditskaya, Yulia and Song, Yi-Zhe},
title = {Pixelor: A Competitive Sketching AI Agent. so You Think You Can Sketch?},
year = {2020},
publisher = {Association for Computing Machinery},
volume = {39},
number = {6},
journal = {ACM Trans. Graph.},
articleno = {166},
numpages = {15}
}
</code></pre></div></div>Ayan Kumar BhuniaPaper Suppl.rlx: A modular Deep RL library for research2020-06-27T00:00:00+00:002020-06-27T00:00:00+00:00https://ayandas.me/projs/2020/06/27/rlx-deep-rl-library<p><code class="language-plaintext highlighter-rouge">rlx</code> is a Deep RL library written on top of PyTorch & built for <em>educational and research</em> purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but <code class="language-plaintext highlighter-rouge">rlx</code> is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, <code class="language-plaintext highlighter-rouge">rlx</code> adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).</p>
<p><a href="https://github.com/dasayan05/rlx" target="_blank" class="fa fa-github fa-3x" style="float: right;"></a></p>
<p>Concisely, <code class="language-plaintext highlighter-rouge">rlx</code> is supposed to</p>
<ol>
<li>Be generic (i.e., can be adopted for any task at hand)</li>
<li>Have modular lower-level components exposed to users</li>
<li>Be easy to implement new algorithms</li>
</ol>
<p>For the sake of completeness, it also provides few popular algorithms as baseline (more to be added soon). Here’s a basic example of PPO (with clipping) implementation with <code class="language-plaintext highlighter-rouge">rlx</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">base_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(</span><span class="n">horizon</span><span class="p">)</span> <span class="c1"># sample an episode as a 'Rollout' object
# 'rewards' and 'logprobs' for all timesteps
</span><span class="n">base_rewards</span><span class="p">,</span> <span class="n">base_logprobs</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">logprobs</span>
<span class="n">base_returns</span> <span class="o">=</span> <span class="n">base_rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># Monte-carlo estimates of 'returns'
</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k_epochs</span><span class="p">):</span>
<span class="c1"># 'evaluate' an episode against a policy and get a new 'Rollout' object
</span> <span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">base_rollout</span><span class="p">)</span>
<span class="n">logprobs</span><span class="p">,</span> <span class="n">entropy</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">entropy</span> <span class="c1"># get 'logprobs' and 'entropy' for all timesteps
</span> <span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># .. also 'value' estimates
</span>
<span class="n">ratios</span> <span class="o">=</span> <span class="p">(</span><span class="n">logprobs</span> <span class="o">-</span> <span class="n">base_logprobs</span><span class="p">.</span><span class="n">detach</span><span class="p">()).</span><span class="n">exp</span><span class="p">()</span>
<span class="n">advantage</span> <span class="o">=</span> <span class="n">base_returns</span> <span class="o">-</span> <span class="n">values</span>
<span class="n">policyloss</span> <span class="o">=</span> <span class="o">-</span> <span class="n">torch</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">ratios</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">clip</span><span class="p">,</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">clip</span><span class="p">))</span> <span class="o">*</span> <span class="n">advantage</span><span class="p">.</span><span class="n">detach</span><span class="p">()</span>
<span class="n">valueloss</span> <span class="o">=</span> <span class="n">advantage</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">policyloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">valueloss</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">-</span> <span class="n">entropy</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="n">agent</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">agent</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>
<p>This is all you have to write to get PPO running.</p>
<h2 id="design-and-api">Design and API</h2>
<p>User needs to provide a parametric function that defines the computation at <em>each time-step</em> and follows a specific signature (i.e., <code class="language-plaintext highlighter-rouge">rlx.Parametric</code>). <code class="language-plaintext highlighter-rouge">rlx</code> will take care of the rest e.g., tie them up to form full rollouts, preserving recurrence (it works seamlessly with recurrent policies) etc.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PolicyValueModule</span><span class="p">(</span><span class="n">rlx</span><span class="p">.</span><span class="n">Parametric</span><span class="p">):</span>
<span class="s">""" Recurrent policy network with state-value (baseline) prediction """</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">states</span><span class="p">):</span>
<span class="c1"># Recurrent state from the last time-step will come in automatically
</span> <span class="n">recur_state</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">states</span>
<span class="p">...</span>
<span class="n">action1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Normal</span><span class="p">(...)</span>
<span class="n">action2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(...)</span>
<span class="n">state_value</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_value_net</span><span class="p">(...)</span>
<span class="k">return</span> <span class="n">next_recur_state</span><span class="p">,</span> <span class="n">rlx</span><span class="p">.</span><span class="n">ActionDistribution</span><span class="p">(</span><span class="n">action1</span><span class="p">,</span> <span class="n">action2</span><span class="p">,</span> <span class="p">...),</span> <span class="n">state_value</span>
<span class="n">network</span> <span class="o">=</span> <span class="n">PolicyValueModule</span><span class="p">(...)</span>
</code></pre></div></div>
<p>While the <code class="language-plaintext highlighter-rouge">next_recur_state</code> and <code class="language-plaintext highlighter-rouge">state_value</code> are optional (i.e., can be <code class="language-plaintext highlighter-rouge">None</code>), a multi-component action distribution needs to be returned. <code class="language-plaintext highlighter-rouge">rlx</code> will take care of sampling from it and computing log-probabilities. The first two return values are necessary, the rest are optional. You can return any number of quantity after first two arguments as <em>extras</em> - they will all be tracked.</p>
<hr />
<p>The design is centered around the primary data structure <code class="language-plaintext highlighter-rouge">Rollout</code> which can hold a sequence of experience tuples <code class="language-plaintext highlighter-rouge">(state, action, reward)</code>, action distributions and any arbitrary quantity returned from the <code class="language-plaintext highlighter-rouge">rlx.Parametric.forward()</code>. <code class="language-plaintext highlighter-rouge">Rollout</code> internally keeps track of the computation graph (if necessary/requested). One has to sample a <code class="language-plaintext highlighter-rouge">Rollout</code> instance by running the agent in the environment. The rollout can then provide quantities like log-probs and anything else that was tracked, upon request.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">set_grad_enabled</span><span class="p">(...):</span>
<span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">network</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">rollout</span><span class="p">.</span><span class="n">mc_returns</span><span class="p">()</span> <span class="c1"># populate its 'returns' property to naive Monte-Carlo returns
</span> <span class="n">logprobs</span><span class="p">,</span> <span class="n">returns</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span><span class="p">,</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span>
<span class="n">values</span><span class="p">,</span> <span class="o">=</span> <span class="n">rollout</span><span class="p">.</span><span class="n">others</span> <span class="c1"># any 'extra' quantity computed will be available as rollout.others
</span></code></pre></div></div>
<p>We can enable/disable gradients by the pytorch way (i.e., <code class="language-plaintext highlighter-rouge">torch.set_grad_enabled(..)</code> etc.).</p>
<p>The flag <code class="language-plaintext highlighter-rouge">dry=True</code> means the rollout instance will only hold <code class="language-plaintext highlighter-rouge">(state, action, reward)</code> tuples and nothing else. This design allows the rollouts to be re-evaluated against another policy - as required by some algorithms (like PPO). Such rollouts cannot offer logprobs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 'rollout' is not dry, it has computation graph attached
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">other_policy</span><span class="p">).</span><span class="n">evaluate</span><span class="p">(</span><span class="n">dry_rollout</span><span class="p">)</span>
</code></pre></div></div>
<p>This API has another benefit. One can sample an episode from a policy in dry-mode, then <code class="language-plaintext highlighter-rouge">.vectorize()</code> it and re-evaluate it against the same policy. This bring in computational benefits.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">dry_rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">episode</span><span class="p">(...,</span> <span class="n">dry</span><span class="o">=</span><span class="n">Try</span><span class="p">)</span>
<span class="n">dry_rollout_vec</span> <span class="o">=</span> <span class="n">dry_rollout</span><span class="p">.</span><span class="n">vectorize</span><span class="p">()</span> <span class="c1"># internally creates a batch dimension for efficient processing
</span><span class="n">rollout</span> <span class="o">=</span> <span class="n">agent</span><span class="p">(</span><span class="n">policy</span><span class="p">).</span><span class="n">evalue</span><span class="p">(</span><span class="n">dry_rollout_vec</span><span class="p">)</span>
</code></pre></div></div>
<p>If the rollout is not dry and gradients were enabled, one can directly do a backward pass</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">rollout</span><span class="p">.</span><span class="n">logprobs</span> <span class="o">*</span> <span class="n">rollout</span><span class="p">.</span><span class="n">returns</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backword</span><span class="p">()</span>
</code></pre></div></div>
<hr />
<p>As you might have noticed, the network is not a part of the agent. In fact, the agent only has a copy of the environment and nothing else. One needs to <em>augment</em> the agent with a network in order for it to sample episode. This design allows us to easily run the agent using a different policy, for example, a “behavior policy” in off-policy RL</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>behaviour_rollout = agent(behavior_policy).episode(...)
behaviour_logprobs = behaviour_rollout.logprobs # record them for computing importance ratio afterwards
</code></pre></div></div>
<hr />
<p><code class="language-plaintext highlighter-rouge">Rollout</code> has a nice API which is useful for writing customized algorithm or implementation tricks. We can</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># shuffle rollouts ..
</span><span class="n">rollout</span><span class="p">.</span><span class="n">shuffle</span><span class="p">()</span>
<span class="c1"># .. index/slice them
</span><span class="n">rollout</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># remove the end-state
</span><span class="n">rollout</span><span class="p">[:</span><span class="mi">100</span><span class="p">]</span> <span class="c1"># recurrent rollouts can be too long (RNNs have long-term memory problems)
</span>
<span class="c1"># .. or even concat them
</span><span class="p">(</span><span class="n">rollout1</span> <span class="o">+</span> <span class="n">rollout2</span><span class="p">).</span><span class="n">vectorize</span><span class="p">()</span>
</code></pre></div></div>
<p>NOTE: I will write more docs if get time. Follow the algorithm implementations at <code class="language-plaintext highlighter-rouge">rlx/algos/*</code> for more API usage.</p>
<h2 id="installation-and-usage">Installation and usage</h2>
<p>Right now, there is no <code class="language-plaintext highlighter-rouge">pip</code> package, its just this repo. You can install it by cloning it and doing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install .
</code></pre></div></div>
<p>For example usage, follow the <code class="language-plaintext highlighter-rouge">main.py</code> script. You can test an algorithm by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python main.py --algo ppo --policytype rnn --batch_size 16 --max_episode 5000 --horizon 200 --env CartPole-v0 --standardize_return
</code></pre></div></div>
<p>The meaning of batch-size is a little different here. It means on how many rollouts the gradient will be averaged (Currently that’s how its done).</p>
<h2 id="experiments">Experiments</h2>
<ul>
<li>Basic environments</li>
</ul>
<p>The “Incomplete”-prefixed environments are examples of POMDP. Their state representations have been masked to create partial observability. They can be only be solved by recurrent policies.</p>
<center>
<img src="/public/proj_res/4/exp.png" />
</center>
<ul>
<li>A little modified (simplified) <code class="language-plaintext highlighter-rouge">SlimeVolleyGym-v0</code> <a href="https://github.com/hardmaru/slimevolleygym">environment by David Ha</a>. An MLP agent trained with PPO learns to play volleyball by self-play experiences, provided at <code class="language-plaintext highlighter-rouge">examples/slime.py</code>.</li>
</ul>
<center>
<img width="80%" src="/public/proj_res/4/volley.gif" />
</center>
<hr />
<h2 id="plans">Plans</h2>
<p>Currently <code class="language-plaintext highlighter-rouge">rlx</code> has following algorithms, but it is <strong>under active development</strong>.</p>
<ol>
<li>Vanilla REINFORCE</li>
<li>REINFORCE with Value-baseline</li>
<li>A2C</li>
<li>PPO with clipping</li>
<li>OffPAC</li>
</ol>
<h4 id="todo">TODO:</h4>
<ol>
<li>More SOTA algorithms (DQN, DDPG, etc.) to be implemented</li>
<li>Create a uniform API/interface to support Q-learning algorithm</li>
<li>Multiprocessing/Parallelization support</li>
</ol>
<h4 id="contributions">Contributions</h4>
<p>You are more than welcome to contribute anything.</p>Ayan Dasrlx is a Deep RL library written on top of PyTorch & built for educational and research purpose. Majority of the libraries/codebases for Deep RL are geared more towards reproduction of state-of-the-art algorithms on very specific tasks (e.g. Atari games etc.), but rlx is NOT. It is supposed to be more expressive and modular. Rather than making RL algorithms as black-boxes, rlx adopts an API that tries to expose more granular operation to the users which makes writing new algorithms easier. It is also useful for implementing task specific engineering into a known algorithm (as we know RL is very sensitive to small implementation engineerings).BézierSketch: A generative model for scalable vector sketches2020-05-22T00:00:00+00:002020-05-22T00:00:00+00:00https://ayandas.me/pubs/2020/05/22/pub-7<center>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630.pdf">
<i class="fa fa-file-pdf-o fa-3x"></i>Paper
</a>
<a target="_blank" class="pubicon" href="https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123710630-supp.pdf">
<i class="fa fa-files-o fa-3x"></i>Suppl.
</a>
</center>
<p><br /></p>
<div>
<center>
<section class="accordion">
<section class="accordion-tabs">
<button class="accordion-tab accordion-active" data-actab-group="0" data-actab-id="0">Abstract</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="1">Slides</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="2">Video Talk</button>
<button class="accordion-tab " data-actab-group="0" data-actab-id="3">Code/Data</button>
</section>
<section class="accordion-content">
<article class="accordion-item accordion-active" data-actab-group="0" data-actab-id="0">
<div class="accordion-item__container">
<img src="/public/pub_res/7.png" style="width: 40%; float: left; margin: 15px; " />
<p style="text-align: justify;">The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark.</p>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="1">
<div class="accordion-item__container">
<center>
<!-- Visit this to create the oEmbed link -->
<!-- https://iframely.com/domains/speakerdeck -->
<h2>Slides for my ECCV '20 talk</h2>
<div style="left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.1972%;">
<iframe src="https://speakerdeck.com/player/b27372d9cf7f4f5ebb9a90bb2469b36f" style="top: 0; left: 5%; width: 90%; height: 90%; position: absolute; border: 0;" allowfullscreen="" scrolling="no" allow="encrypted-media">
</iframe>
</div>
<p>PS: Reusing any of these slides would require permission from the author.</p>
</center>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="2">
<div class="accordion-item__container">
<h2>Full talk at ECCV 2020</h2>
<iframe width="800" height="450" src="https://www.youtube-nocookie.com/embed/g2zzaLr2VfQ" frameborder="1" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</article>
<article class="accordion-item " data-actab-group="0" data-actab-id="3">
<div class="accordion-item__container">
<center>
<a target="_blank" class="pubicon" href="https://github.com/dasayan05/stroke-ae">
<i class="fa fa-file-code-o fa-3x"></i>Github repo
</a>
</center>
</div>
</article>
</section>
</section>
</center>
</div>
<p><br /></p>
<script type="text/javascript" src="/public/js/tabing.js"></script>
<h2 id="want-to-cite-this-paper-">Want to cite this paper ?</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@InProceedings{das2020bziersketch,
title = {BézierSketch: A generative model for scalable vector sketches},
author = {Ayan Das and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song},
booktitle = {The European Conference on Computer Vision (ECCV)},
year = {2020}
}
</code></pre></div></div>Ayan DasPaper Suppl.Introduction to Probabilistic Programming2020-05-05T00:00:00+00:002020-05-05T00:00:00+00:00https://ayandas.me/blog-tut/2020/05/05/probabilistic-programming<p>Welcome to another tutorial about probabilistic models, after <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">a primer on PGMs</a> and <a href="https://ayandas.me/blog-tut/2020/01/01/variational-autoencoder.html">VAE</a>. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of <strong>Probabilistic Programming</strong> has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">Probabilistic Graphical Models</a> (PGMs), but <em>equipped with imperative programming style</em> (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such <em>Universal</em> Probabilistic Programming Language, <a href="http://pyro.ai/">Pyro</a>, that came out of <a href="https://www.uber.com/us/en/uberai/">Uber’s AI lab</a> and started gaining popularity.</p>
<h1 id="overview">Overview</h1>
<p>Before I dive into details, let’s get the bigger picture clear. It is highly advisable to read any good reference about PGMs before you proceed - my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">previous article</a> for example.</p>
<h3 id="generative-view--execution-trace">Generative view & Execution trace</h3>
<p>Probabilistic Programming is NOT really what we usually think of as <em>programming</em> - i.e., completely deterministic execution of hard-coded instructions which does exactly what its told and nothing more.
Rather it is about building PGMs (must read <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">this</a>) which models our belief about the data generation process. We, as users of such language, would express a model in an imperative form which would encode all our uncertainties in the way we want. Here is a (Toy) example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">theta</span><span class="p">):</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">];</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">P</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">A</span>
<span class="k">if</span> <span class="n">A</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">P</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">A</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span>
</code></pre></div></div>
<p>If you assume this to be valid program (for now), this is what we are talking about here - all our traditional “variables” become “random variables” (RVs) and have uncertainty associated with them in the form of probability distributions. Just to give you a taste of its flexibility, here’s the constituent elements we encountered</p>
<ol>
<li>Various different distributions are available (e.g., Normal, Bernoulli, Uniform etc.)</li>
<li>We can do deterministic computation (i.e., \(P = 2 * A\))</li>
<li>Condition RVs on another RVs (i.e., \(C\vert B \sim \mathcal{N}(B, 1)\))</li>
<li>Imperative style branching allows dynamic structure of the model …</li>
</ol>
<p>Below is a graphical representation of the model defined by the above program.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/example_exectrace.png" />
</figure>
</center>
<p>Just like the invocation of a traditional compiler on a traditional program produces the desired output, this (probabilistic) program can be executed by means of “ancestral sampling”. I ran the program 5 times and each time I got samples from all my RVs. Each such “forward” run is often called an <em>execution trace</em> of the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="mf">0.5</span><span class="p">))</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.318</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.069</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.156</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.822</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">0.594</span><span class="p">,</span> <span class="mf">0.865</span><span class="p">)</span>
<span class="p">(</span><span class="mf">1.000</span><span class="p">,</span> <span class="mf">2.000</span><span class="p">,</span> <span class="mf">1.100</span><span class="p">,</span> <span class="mf">1.079</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="mf">1.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.000</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.262</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.403</span><span class="p">)</span>
</code></pre></div></div>
<p>This is the so called “generative view” of a model. We typically use the leaf-nodes of PGMs as our observed data. And rest of the graph can be the “latent factors” of the model which we either know or want to estimate. In general, a practical PGM can often be encapsulated as a set of latent nodes \(\mathbf{Z} \triangleq \{ Z_1, Z_2, \cdots, Z_H \}\) and visible nodes \(\mathbf{X} \triangleq \{ X_1, X_2, \cdots, X_V \}\) related probabilistically as
<br />
\[
\mathbf{Z} \rightarrow \mathbf{X}
\]</p>
<h3 id="training-and-inference">Training and Inference</h3>
<p>From now on, we’ll use the general notation rather than the specific example. The model may be parametric. For example, we had the bernoulli success probability \(\theta\) in our toy example. The full joint probability is given as</p>
<p>\[
\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) = \mathbb{P}_{\theta}(\mathbf{Z}) \cdot \mathbb{P}_{\theta}(\mathbf{X}\vert \mathbf{Z})
\]</p>
<p>We would like to do two things:</p>
<ol>
<li>Estimate model parameters \(\theta\) from data</li>
<li>Compute the posterior, i.e., infer latent variables given data</li>
</ol>
<p>As discussed in my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, both of them are infeasible due to the fact that</p>
<ol>
<li>Log-likehood maximization is not possible because of the presence of latent variables</li>
<li>For continuous distributions on latent variables, the posterior is intractible</li>
</ol>
<p>The way forward is to take help of <em>Variational Inference</em> and maximize our very familiar <strong>E</strong>vidence <strong>L</strong>ower <strong>BO</strong>und (ELBO) loss to estimate the model parameters and also a set of variational parameters which help building a proxy for the original posterior \(\mathbb{P}_{\theta}(\mathbf{Z}\vert \mathbf{X})\). Mathematically, we choose a known and tractable family of distribution \(\mathbb{Q}_{\phi}(\mathbf{Z})\) (parameterized by variational parameters \(\phi\)) to approximate the posterior. The learning process is facilitated by maximizing the following</p>
<p>\[
\mathrm{ELBO}(\theta, \phi) \triangleq \mathbb{E}_{\mathbb{Q}_{\phi}} \bigl[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigr]
\]</p>
<p>by estimating gradients w.r.t all its parameters</p>
<p>\[\tag{1}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi)
\]</p>
<h1 id="black-box-variational-inference">Black-Box Variational Inference</h1>
<p><br />
If you have gone through my <a href="https://ayandas.me/blog-tut/2019/11/20/inference-in-pgm.html">PGM article</a>, you might think you’ve seen these before. Actually, you’re right ! There is really nothing new to this. What we really need for establishing a Probabilistic Programming framework is <strong>a unified way to implement the ELBO optimization for ANY given problem</strong>. And by “problem” I mean the following:</p>
<ol>
<li>A model specification \(\mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) written in a probabilistic language (like we saw before)</li>
<li>An optional (parameterized) “Variational Model” \(\mathbb{Q}_{\phi}(\mathbf{Z})\), famously known as a “Guide”</li>
<li>And .. the observed data \(\mathcal{D}\), of course</li>
</ol>
<!-- Very importantly, we CAN NOT make any *assumptions* about the inner structure of either the "model" or the "guide". This motivated the research on a "Black-box" method for solving such probabilistic programs. Please realize that this is exactly how "traditional compilers" (like C, Python) are built - they make no assumption about the symantic meaning/structure of your program .. they just check for syntactic validity. -->
<p>But, how do we compute (1) ? The appearent problem is that gradient w.r.t. \(\phi\) is required but it appears in the expectation itself. To mitigate this, we make use of the famous trick known as the “log-derivative” trick (it actually has many other names like REINFORCE etc). For notational simplicy let’s denote \(f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \triangleq \log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})\) and continue from (1)</p>
\[\sum_{\mathbf{Z}} \nabla_{[\theta, \phi]} \bigg[ \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]
=\sum_{\mathbf{Z}} \bigg[ \nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \bigg[ \color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot \frac{\nabla_{\phi} \mathbb{Q}_{\phi}(\mathbf{Z})}{\color{red}{\mathbb{Q}_{\phi}(\mathbf{Z})}} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[=\sum_{\mathbf{Z}} \mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \bigg[ \color{red}{\nabla_{\phi} \log\mathbb{Q}_{\phi}(\mathbf{Z})} \cdot f(\mathbf{Z}, \mathbf{X}; \theta, \phi)
+\nabla_{[\theta, \phi]}f(\mathbf{Z}, \mathbf{X}; \theta, \phi) \bigg]\]
\[\tag{2}
= \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{[\theta, \phi]} \bigg( \underbrace{\log\mathbb{Q}_{\phi}(\mathbf{Z}) \cdot \overline{f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}
+f(\mathbf{Z}, \mathbf{X}; \theta, \phi)}_\text{Surrogate Objective} \bigg) \bigg]\]
<p>Eq. (2) shows that the trick helped the \(\nabla_{[\theta, \phi]}\) to penetrate the \(\mathbb{E}[\cdot]\), but in the process, it changed the original \(f\) with a “<a href="https://arxiv.org/abs/1506.05254">surrogate</a> function” \(f_{surr} \triangleq \overline{f}\cdot\log\mathbb{Q}+f\) where the <em>bar</em> protects a quantity from differentiation. Equation (2) is all we need - it provides an insight on how to make the gradient estimation practical. In fact, it can be proven theoretically that this gradient is an unbiased estimate of the true gradient in Equation (1).</p>
<p>Succinctly, we run the Guide \(L\) times to record a set of \(L\) execution-traces (i.e., samples \(\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}\)) and compute the following Monte-Carlo approximation to Equation (2)</p>
<p>\[\tag{3}
\nabla_{[\theta, \phi]} \mathrm{ELBO}(\theta, \phi) \approx \frac{1}{L} \sum_{\mathbf{\widehat{Z}}\sim\mathbb{Q}_{\phi}} \left[ \nabla_{[\theta, \phi]} f_{surr}(\mathbf{\widehat{Z}}, \mathcal{D}) \right]_{\theta=\theta_{old}, \phi=\phi_{old}}
\]</p>
<p>The nice thing about Equation (2) (or equivalently Equation (3)) is we got the differentiation operator right on top of a deterministic function (i.e., \(f_{surr}\)). It means we can construct \(f_{surr}\) as a computation graph and take advantage of modern day automatic differentaition engines. Here’s how the computation graph and the graphical model are linked</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/16/gm_cg.png" />
</figure>
</center>
<p>Last but not the least, let’s look at the function \(f_{surr}\) which is basically built on the log-density terms \(\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X})\) and \(\log \mathbb{Q}_{\phi}(\mathbf{Z})\). We need a way to compute them flexibly. Please remember that the model and guide is written in a <em>language</em> and hence we have access to their graph-structure. A clever software implementation can harness this structure to estimate the log-densities (and eventually \(f_{surr}\)).</p>
<p>I claimed before that the gradient estimates are unbiased. However, such generic way of computing the gradient introduces high variance in the estimate and make things unstable for complex models. There are few tricks used widely to get around them. But please note that such tricks always exploits model-specific structure. Three such tricks are presented below.</p>
<h3 id="i-re-parameterization">I. Re-parameterization</h3>
<p>We might get lucky that \(\mathbb{Q}_{\phi}(\mathbf{Z})\) is <a href="https://arxiv.org/abs/1312.6114">re-parameterizable</a>. What that means is the expectation w.r.t \(\mathbb{Q}_{\phi}(\mathbf{Z})\) can be made free of its parameters and by doing so the gradient operator can be pushed inside without going through the log-derivative trick.
So, let’s step back a bit and consider the original ELBO gradient in (1). Assuming re-parameterizable nature, the following can be done
\[
\nabla_{[\theta, \phi]} \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z}) \bigg] = \nabla_{[\theta, \phi]} \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\log \mathbb{P}_{\theta}(G_{\phi}(\epsilon), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg]
\]
\[
= \mathbb{E}_{Q(\mathbf{\epsilon})} \bigg[\nabla_{[\theta, \phi]} \bigg( \log \mathbb{P}_{\theta}(G_{\phi}(\mathbf{\epsilon}), \mathbf{X}) - \log \mathbb{Q}_{\phi}(G_{\phi}(\epsilon)) \bigg) \bigg]
\]</p>
<p>Where \(Q(\mathbf{\epsilon})\) is an independent source of randomness. Computing this expectation with empirical average (just like Eq.2) gives us a better (variance reduced) estimate of the true gradient of ELBO.</p>
<h3 id="ii-rao-blackwellization">II. Rao-Blackwellization</h3>
<p>This is another well-known variance reduction technique. It is a bit mathematically rigorous, so I will explain it simply without making it confusing. This requires the full variational distributions to have some kind of factorization. A specific case is when we have mean-field assumption, i.e.</p>
<p>\[
\mathbb{Q}_{\phi}(\mathbf{Z}) = \prod_i Q_{\phi_i}(Z_i)
\]</p>
<p>With a little effort, we can pull out the gradient estimator for each of these \(\phi_i\) parameters from (2). They look something like this</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) = \mathbb{E}_{\mathbb{Q}_{\phi}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}_{\theta}(\mathbf{Z}, \mathbf{X}) - \log \mathbb{Q}_{\phi}(\mathbf{Z})} \bigg)
+\cdots \bigg]\]
<p>The reason why the quantity under bar still has all the factors because it is immune to gradient operator. Also because the expectation is outside the gradient operator, it contains all factors. At this point, the Rao-Blackwellization offers a variance-reduced estimate of the above gradient, i.e.,</p>
\[\nabla_{\phi_i} \mathrm{ELBO}(\theta, \phi) \approx \mathbb{E}_{\mathbb{Q}_{\phi}^{(i)}} \bigg[ \nabla_{\phi_i} \log\mathbb{Q}_{\phi_i}(Z_i) \cdot \bigg( \overline{\log \mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \mathbf{X}) - \log \mathbb{Q}_{\phi_i}(Z_i)} \bigg)
+\cdots \bigg]\]
<p>where \(\mathbf{Z}^{(i)}\) is the set of variables that forms the “markov blanket” of \(Z_i\) w.r.t to the structure of guide, \(\mathbb{Q}_{\phi}^{(i)}\) is the part of the variational distribution that depends on \(\mathbf{Z}^{(i)}\) and \(\mathbb{P}^{(i)}_{\theta}(\mathbf{Z}^{(i)}, \cdot)\) is the factors of the model that involves \(\mathbf{Z}^{(i)}\).</p>
<h3 id="iii-explicit-enumeration-for-discrete-rvs">III. Explicit enumeration for Discrete RVs</h3>
<p>While exploiting the graph structure of the guide while simplifying (1), we might end up getting a term like this due to factorization in the guide density</p>
<p>\[
\mathbb{E}_{Z_i\sim\mathbb{Q}_{\phi_i}(Z_i)} \bigl[ f(\cdot) \bigr]
\]</p>
<p>If it happens that the variable \(Z_i\) is discrete with the size of its state space reasonably small (e.g., a \(d=5\) dimensional binary RV has \(2^5 = 32\) states), we can replace sampling-based empirical expectations with true expectation where we have to evaluate a sum over its entire state-space</p>
<p>\[
\sum_{Z_i} \mathbb{Q}_{\phi_i}(Z_i)\cdot f(\cdot)
\]</p>
<p>So make sure the state-space is resonable in size. This helps reducing the variance quite a bit.</p>
<p>Whew ! That’s a lot of maths. But good thing is, you hardly ever have to think about them in detail because software engineers have put tremendous effort to make these algorithms as easily accessible as possible via libraries. One of them we are going to have a brief look on.</p>
<h1 id="pyro-universal-probabilistic-programming"><code class="language-plaintext highlighter-rouge">Pyro</code>: Universal Probabilistic Programming</h1>
<p><a href="http://pyro.ai/">Pyro</a> is a probabilistic programming framework that allows users to write flexible models in terms of a simple API. Pyro is written in Python and uses the popular PyTorch library for its internal representation of computation graph and as auto differentiation engine. Pyro is quite expressive due to the fact that it allows the model/guide to have fully imperative flow. It’s core API consists of these functionalities</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">pyro.param()</code> for defining learnable parameters.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.dist</code> contains a large collection of probability distribution.</li>
<li><code class="language-plaintext highlighter-rouge">pyro.sample()</code> for sampling from a given distribution.</li>
</ol>
<p>Let’s take a concrete example and work it out.</p>
<h4 id="problem-mixture-of-gaussian">Problem: Mixture of Gaussian</h4>
<p>MoG (Mixture of Gaussian) is a realatively simple but widely studied probabilistic model. It has an important application in soft-clustering. For the sake of simplicity we assume we only have two mixtures. The generative view of the model is basically this: we flip a coin (latent) with bias \(\rho\) and depending on the outcome \(C\in \{ 0, 1 \}\) we sample data from either of the two gaussian \(\mathcal{N}(\mu_0, \sigma_0)\) and \(\mathcal{N}(\mu_1, \sigma_1)\)</p>
\[C_i \sim Bernoulli(\rho) \\
X_i \sim \mathcal{N}(\mu_{C_i}, \sigma_{C_i})\]
<p>where \(i = 1 \cdots N\) is data index, \(\theta \triangleq \{ \rho, \mu_1, \sigma_1, \mu_2, \sigma_2 \}\) is the set of model parameters we need to learn. This is how you write that in Pyro:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Take the observation
</span> <span class="c1"># Define coin bias as parameter. That's what 'pyro.param' does
</span> <span class="n">rho</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"rho"</span><span class="p">,</span> <span class="c1"># Give it a name for Pyro to track properly
</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">]),</span> <span class="c1"># Initial value
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># Has to be in [0, 1]
</span> <span class="c1"># Define both means and std with random initial values
</span> <span class="n">means</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"M"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.5</span><span class="p">,</span> <span class="mf">3.</span><span class="p">]))</span>
<span class="n">stds</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"S"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]),</span>
<span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">positive</span><span class="p">)</span> <span class="c1"># std deviation cannot be negative
</span>
<span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># Mark conditional independence
</span> <span class="c1"># construct a Bernoulli and sample from it.
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">rho</span><span class="p">))</span> <span class="c1"># c \in {0, 1}
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">LongTensor</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="n">stds</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="c1"># pick a mean as per 'c'
</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"x"</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">obs</span><span class="o">=</span><span class="n">data</span><span class="p">)</span> <span class="c1"># sample data (also mark it as observed)
</span></code></pre></div></div>
<p>Due to the discrete and low dimensional nature of the latent variable \(C\), this problem is in general tracktable in terms of computing posterior. But let’s assume it is not. The true posterior \(\mathbb{P}(C_i\vert X_i)\) is the quantity known as “assignment” that reveals the latent factor, i.e., what was the coin toss result when a given \(X_i\) was sampled. We define a guide on \(C\), parameterized by variational parameters \(\phi \triangleq \{ \lambda_i \}_{i=1}^N\)</p>
\[C_i \sim Bernoulli(\lambda_i)\]
<p>In Pyro, we define a guide that encodes this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">guide</span><span class="p">(</span><span class="n">data</span><span class="p">):</span> <span class="c1"># Guide doesn't require data; just need the value of N
</span> <span class="k">with</span> <span class="n">pyro</span><span class="p">.</span><span class="n">plate</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span> <span class="c1"># conditional independence
</span> <span class="c1"># Define variational parameters \lambda_i (one for every data point)
</span> <span class="n">lam</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">param</span><span class="p">(</span><span class="s">"lam"</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)),</span> <span class="c1"># randomly initiallized
</span> <span class="n">constraint</span><span class="o">=</span><span class="n">dist</span><span class="p">.</span><span class="n">constraints</span><span class="p">.</span><span class="n">unit_interval</span><span class="p">)</span> <span class="c1"># \in [0, 1]
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="c1"># Careful, this name HAS TO BE same to match the model
</span> <span class="n">dist</span><span class="p">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="n">lam</span><span class="p">))</span>
</code></pre></div></div>
<p>We generate some synthetic data from the following simualator to train our model on.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">getdata</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">mean1</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span> <span class="n">mean2</span><span class="o">=-</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">std1</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">std2</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="n">D1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std1</span> <span class="o">+</span> <span class="n">mean1</span>
<span class="n">D2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="o">//</span><span class="mi">2</span><span class="p">,)</span> <span class="o">*</span> <span class="n">std2</span> <span class="o">+</span> <span class="n">mean2</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">D1</span><span class="p">,</span> <span class="n">D2</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">D</span><span class="p">)</span>
<span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">D</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">))</span>
</code></pre></div></div>
<p>Finally, Pyro requires a bit of boilerplate to setup the optimization</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">getdata</span><span class="p">(</span><span class="mi">200</span><span class="p">)</span> <span class="c1"># 200 data points
</span><span class="n">pyro</span><span class="p">.</span><span class="n">clear_param_store</span><span class="p">()</span>
<span class="n">optim</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">({})</span>
<span class="n">svi</span> <span class="o">=</span> <span class="n">pyro</span><span class="p">.</span><span class="n">infer</span><span class="p">.</span><span class="n">SVI</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">guide</span><span class="p">,</span> <span class="n">optim</span><span class="p">,</span> <span class="n">infer</span><span class="p">.</span><span class="n">Trace_ELBO</span><span class="p">())</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">svi</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>That’s pretty much all we need. I have plotted the (1) ELBO loss, (2) Variational parameter \(\lambda_i\) for every data points, (3) The two gaussians in the model and (4) The coin bias as the training progresses.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/16/example_loss.gif" />
</figure>
</center>
<p>The full code is available in this gist: <a href="https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8">https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8</a>.</p>
<hr />
<p>That’s all for today. Hopefully I was able to convey the bigger picture of probabilistic programming which is quite useful for modelling lots of problems. The following references the sources of information while writing the article. Interested readers are encouraged to check them out.</p>
<ol>
<li><a href="http://pyro.ai/examples/svi_part_iii.html">Pyro’s VI tutorial</a></li>
<li><a href="https://arxiv.org/abs/1401.0118">Black Box variational inference</a></li>
<li><a href="https://arxiv.org/abs/1506.05254">Gradient Estimation Using Stochastic Computation Graphs</a></li>
<li><a href="https://arxiv.org/abs/1701.03757">Deep Probabilistic Programming</a></li>
<li><a href="https://arxiv.org/abs/1810.09538">Pyro: Deep Universal Probabilistic Programming</a></li>
</ol>Ayan DasWelcome to another tutorial about probabilistic models, after a primer on PGMs and VAE. However, I am particularly excited to discuss a topic that doesn’t get as much attention as traditional Deep Learning does. The idea of Probabilistic Programming has long been there in the ML literature and got enriched over time. Before it creates confusion, let’s declutter it right now - it’s not really writing traditional “programs”, rather it’s building Probabilistic Graphical Models (PGMs), but equipped with imperative programming style (i.e., iterations, branching, recursion etc). Just like Automatic Differentiation allowed us to compute derivative of arbitrary computation graphs (in PyTorch, TensorFlow), Black-box methods have been developed to “solve” probabilistic programs. In this post, I will provide a generic view on why such a language is indeed possible and how such black-box solvers are materialized. At the end, I will also introduce you to one such Universal Probabilistic Programming Language, Pyro, that came out of Uber’s AI lab and started gaining popularity.Patterns of Randomness2020-04-15T00:00:00+00:002020-04-15T00:00:00+00:00https://ayandas.me/blog-tut/2020/04/15/patterns-of-randomness<p>Welcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as <em>Mathematical Art</em>. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the <a href="https://github.com/dasayan05/patterns-of-randomness">code</a>. <strong>Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient</strong>.</p>
<h1 id="-random-walk--brownian-motion-">[ Random Walk & Brownian Motion ]</h1>
<p>Let’s start with something simple. Consider a Random Variable \(\mathbf{R}_t\) (\(t\) being time) with support \(\{ -1, +1\}\) with equal probability on both of its possible values. Think of it as a <em>score</em> you get at time \(t\) which can be either \(+1\) or \(-1\) as a result of an unbiased coin-flip. In terms of probability:</p>
<p>\[
\mathbb{P}\bigl[ \mathbf{R}_t = +1 \bigr] = \mathbb{P}\bigl[ \mathbf{R}_t = -1 \bigr] = \frac{1}{2}
\]</p>
<p>Realization (samples) of \(\mathbf{R}_t\) for \(t=0 \rightarrow T (=10)\) would look like
\[
\bigl[ +1, -1, -1, +1, -1, -1, -1, +1, +1, -1, +1 \bigr]
\]</p>
<p>Let us define another Random Variable \(\mathbf{S}_t\) which is nothing but an accumulator of \(\mathbf{R}_t\) till time \(t\). So, by definition</p>
<p>\[
\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i
\]
Realization of \(\mathbf{S}_t\) corresponding to above \(\mathbf{R}_t\) sequence would look like
\[
\bigl[ +1, 0, -1, 0, -1, -2, -3, -2, -1, -2, -1 \bigr]
\]</p>
<p>This is popularly known as the <strong>Random Walk</strong>. With the basics ready, let us have two such random walks namely \(\mathbf{S}^x_t\) and \(\mathbf{S}^y_t\) and treat them as \(X\) and \(Y\) coordinates of a <em>Random Vector</em> namely \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\).</p>
<p>As of now it look all nice and mathy, right ! Here’s the fun part. Let me keep the time (i.e., \(t\)) running and keep track on the path that the vector \(\bar{\mathbf{S}}_t\) traces on a 2D plane</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/2d_disc_brown.gif" />
</figure>
</center>
<p>It will create a cool random checkerboard-like pattern as time goes on. Looking at the tip (the ‘dot’), you might see it as a tiny particle. As it happened that this is a discretized verision of a continuous <a href="http://www1.lsbu.ac.uk/water/Brownian.html">phenomenon observed in real microscopic particles in fluid</a>, famously known as <strong>Brownian Motion</strong>.</p>
<p>Real Brownian Motion is continuous. Let’s work it out, but very briefly. We divide an arbitrary time interval \([0, T]\) into \(N\) small intervals of length \(\displaystyle{ \Delta t = \frac{T}{N} }\) and have a modified score Random Variable \(\mathbf{R}_t\) with support \(\displaystyle{ \left\{ +\sqrt{\frac{T}{N}}, -\sqrt{\frac{T}{N}} \right\} }\) with equal probability as before. We still have the same definition of \(\mathbf{S}_t = \sum_{i=0}^t \mathbf{R}_i\). It so happened that as we appraoch the limiting case of</p>
<p>\[
N \rightarrow \infty,\text{ and consequently } \sqrt{\frac{T}{N}} \rightarrow 0\text{ and } \Delta t\rightarrow 0
\]</p>
<p>it gives us the continuous analogue of <strong>Brownian Motion</strong>. Similar to the discrete case, if we trace the path of \(\displaystyle{ \bar{\mathbf{S}}_t \triangleq \begin{bmatrix} \mathbf{S}^x_t \\ \mathbf{S}^y_t \end{bmatrix} }\) with large \(N\) (yes, in practice we cannot go to infinity, sorry), patterns like this will emerge</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px;" src="/public/posts_res/15/brown.gif" />
</figure>
</center>
<p>To make it more artistic, I took an even bigger \(N\) and ran the simulation for quite a while and got quite beautiful jittery patterns. Random numbers being at the heart of the phenomenon, we’ll get different patterns in different runs. Here are two such simulation results:</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/brownian_full.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Brownian_motion">Wikipedia</a></li>
<li><a href="https://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric BM</a></li>
<li><a href="https://en.wikipedia.org/wiki/It%C3%B4_calculus">Stochastic Calculus</a></li>
</ol>
<h1 id="-dynamical-systems--chaos-">[ Dynamical Systems & Chaos ]</h1>
<p>Dynamical Systems are defined by a state space \(\mathbb{R}^n\) and a system dynamics (a function \(\mathbf{F}\)). A state \(\mathbf{x}\in\mathbb{R}^n\) is a specific (abstract) configuration of a system and the dynamics determines how the state “evolves” over time. The dynamics is often represented by a <a href="https://en.wikipedia.org/wiki/Differential_equation">differential equation</a> that specifies the chnage of state over time. So,</p>
<p>\[
\mathbf{F}(\mathbf{x}, t) \triangleq \frac{d\mathbf{x}}{dt}
\]</p>
<p>The true states of the system at some point of time is determined by solving and Initial Value Problem (IVP) starting from an initial state \(\mathbf{x}_0\). We then solve consecutive states with \(t\gt 0\) as</p>
<p>\[
\mathbf{x}_t = \mathbf{x}_0 + \Delta t \cdot \mathbf{F}(\mathbf{x}, t)
\]</p>
<p>Having sufficiently small \(\Delta t\) ensures propert evolution of states.</p>
<p>Now this may seem quite trivial, at least to those who have studied Differential Equations. But, there are specific cases of \(\mathbf{F}\) which leads to an evolution of states whose trajectory is surprisingly beautiful. For reasons that are beyond the scope of this article, these are called <strong>Chaos</strong>. There is a specific branch of dynamical systems (named “<a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a>”) that deals with characteristics of such chaotic systems. Below are three such chaotic systems with there trajectory visualized in 3D state space. To be specific, we take each system with an initial state (they are very sensitive to initial states) and compute successive states with a small enough \(\Delta t\) and visualize them as a continuous path in 3D. The corresponding figures depict an animation of the evolution of states over time as well as the whole trajectory all at once.</p>
<h3 id="lorentz-system">Lorentz System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ \sigma (y-x), x(\rho - z) - y, xy - \beta z \bigr]^T
\]
\[
\text{with }\sigma = 10, \beta = \frac{8}{3}, \rho = 28 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/lorentz.gif" />
</figure>
</center>
<h3 id="rössler-system">Rössler System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -(y+z), x+Ay, B+xz-Cz \bigr]^T
\]
\[
\text{with }A=0.2, B=0.2, C=5.7 \text{, and } \mathbf{x}_0 = \bigl[ 1,1,1 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/roseller.gif" />
</figure>
</center>
<h3 id="halvorsen-system">Halvorsen System</h3>
<p>\[
\frac{d\mathbf{x}}{dt} = \bigl[ -ax-4y-4z-y^2, -ay-4z-4x-z^2, -az-4x-4y-x^2 \bigr]^T
\]
\[
\text{with }a=1.89 \text{, and } \mathbf{x}_0 = \bigl[ -1.48, -1.51, 2.04 \bigr]
\]</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/helvorsen.gif" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Differential_equation">Differential Equation</a>, <a href="https://en.wikipedia.org/wiki/Dynamical_system">Dynamical System</a></li>
<li><a href="https://en.wikipedia.org/wiki/Chaos_theory">Chaos Theory</a></li>
<li><a href="https://en.wikipedia.org/wiki/Attractor">Attractors</a>, <a href="http://www.stsci.edu/~lbradley/seminar/attractors.html">Strange Attractors</a></li>
<li><a href="https://en.wikipedia.org/wiki/Lorenz_system">Lorentz System</a>, <a href="https://en.wikipedia.org/wiki/R%C3%B6ssler_attractor">Rössler System</a>, <a href="https://www.dynamicmath.xyz/calculus/velfields/Halvorsen/">Halvorsen System</a></li>
</ol>
<h1 id="-complex-fourier-series-">[ Complex Fourier Series ]</h1>
<p>We all know about Fourier Series, right ! But I am sure not all of you have seen this artistic side of it. Well, this isn’t really related to fourier series, but fourier series helps in creating them.</p>
<p>We know the following to be the “synthesis equation” of complex fourier series</p>
<p>\[
f(t) = \sum_{n=-\infty}^{+\infty} c_n e^{j \frac{2\pi n}{T} t} \in \mathbb{C}
\]</p>
<p>which represents the synthesis of a periodic function \(f(t)\) of period \(T\) from its frequency components \(\mathbf{C} \triangleq \left[ c_{-\infty}, \cdots, c_{-2}, c_{-1}, c_{0}, c_{+1}, c_{+2}, \cdots, c_{+\infty} \right]\). Often, as a practical measure, we crop the infinite summation to a limited range \([ -N, N ]\). Furthermore, let’s consider \(T=1\) without lose of generality. So, we see \(f(t)\) as a function parameterized by the frequence components \(\mathbf{C} \in \mathbb{C}^{2N+1}\)</p>
<p>\[
f(t, \mathbf{C}) \approx \sum_{n=-N}^{+N} c_n e^{j 2\pi n t} \in \mathbb{C}
\]</p>
<p>By doing this, we can make complex valued functions by putting different \(\mathbf{C}\) and running \(t=0\rightarrow 1\). However, not all \(\mathbf{C}\) leads to anything visually appealing. A particular feature of an object that appeals to the human eyes is “Symmetry”. We are gonna exploit this here. A little refresher on fourier series will make you realize that if the coefficients are real-valued, then \(f(t, \mathbf{C})\) has symmetric property. And that’s all we need.</p>
<p>We pick random \(\mathbf{C} \in \mathbb{R}^{2N+1}\) (see, its real numbers now) and run the clock \(t=0\rightarrow 1\) and trace the path travelled by the complex point \(f(t, \mathbf{C}) \in \mathbb{C}\) as time progresses. It creates patterns like the ones shown below</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_6.gif" />
</figure>
</center>
<p>There is one way to customize these - the value of \(N\). As we know that \(c_n\) has the interpretation of the magnitude of \(n^{th}\) frequency component. A large value of \(N\) implies the introduction of more high frequency into the time-domain signal. This visually leads to \(f(t)\) having finer details (i.e., more curves and bendings). Lowering the value of \(N\) would clear out these fine details and the path will become more and more flat. The below image shows decreasing value of \(N = 10 \rightarrow 6\) along columns. You can see the patterns losing details as we go right. And just like before, every run will create different patterns as they are solely controlled by random numbered coefficients.</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/fourier_10_6.png" />
</figure>
</center>
<p><strong>Want to learn more ?</strong></p>
<ol>
<li><a href="http://www.ee.ic.ac.uk/hp/staff/dmb/courses/E1Fourier/00300_ComplexFourier.pdf">Complex Fourier Series</a></li>
<li><a href="http://www.jezzamon.com/fourier/">Fourier patterns</a></li>
<li><a href="https://www.youtube.com/watch?v=ds0cmAV-Yek">Visualizing fourier series</a></li>
<li><a href="https://www.youtube.com/watch?v=r6sGWTCMz2k&t=725s">Amazing Video by 3Blue1Brown</a></li>
</ol>
<h1 id="-mandelbrot--julia-set-">[ Mandelbrot & Julia set ]</h1>
<p>These two sets are very important in the study of “Fractals” - objects with self-repeating patterns. Fractals are extremely popular concepts in certain branches of mathematics but they are mostly famous for having eye-catching visual appearance. If you ever come across an article about fractals, you are likely to see some of the most artistic patterns you’ve ever seen in the context of mathematics. Diving into the details of fractals and self-repeating patterns will open a vast world of “Mathematical Art”. Although, in this article, I can only show you a tiny bit of it - two sets namely “Mandelbrot” and “Julia” set. Let’s start with the <em>all important function</em></p>
<p>\[
f_C(z) = z^2 + C
\]</p>
<p>where \(C, f_C(z), z \in \mathbb{C}\) are complex numbers. This appearantly simply complex-valued function is in the heart of these sets. All it does is squares its argument and adds a complex number that the function is parameterized with. Also, we denote \(f^{(k)}_C(z)\) as \(k\) times repeated application of the function on a given \(z\), i.e.</p>
<p>\[
f^{(k)}_C(z) = f_C(\cdots f_C(f_C(z)))
\]</p>
<h3 id="mandelbrot-set">Mandelbrot Set</h3>
<p>With these basic definitions in hand, the <strong>Mandelbrot set</strong> (invented by mathematician <a href="https://en.wikipedia.org/wiki/Benoit_Mandelbrot">Benoit Mandelbrot</a>) is the set of all \(C\in\mathbb{C}\) for which
\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(0+0j) \vert < \infty
\]</p>
<p>Simply put, there is a set of values for \(C\) where if you repeatedly apply \(f_C\) on zero (i.e. \(0+0j\)), the output <em>does not diverge</em>. All such values of \(C\) makes the so called “Mandelbrot Set”. For the values of \(C\) that does not diverge, can be characterized by how many repeated application of \(f_C(\cdot)\) they can tolerate before their absolute value goes higher than a predefined “<em>escape radius</em>”, let’s call it \(r\in\mathbb{R}\). This creates a loose sense of “strength” of a certain \(C\) that can be written as</p>
<p>\[
\mathbb{K}(C) = \max_{\vert f^{(k)}_C(0+0j) \vert \leq r} k
\]</p>
<p>It might look all strange but if you treat the integer \(\mathbb{K}(C)\) as grayscale intensity value for a grid of points on 2D complex plane (i.e., an image), you will get a picture similar to this (Don’t get confused, the picture is indeed grayscale; I added PyPlot’s <a href="https://matplotlib.org/tutorials/colors/colormaps.html"><code class="language-plaintext highlighter-rouge">plt.cm.twilight_shifted</code></a> colormap for enhancing the visual appeal). The grid is in the range \((-2.5+1.5j) \rightarrow (1.5-1.5j)\) and the escape radius is \(r=2.5\).</p>
<center>
<figure>
<img width="100%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_thumbnail.png" />
</figure>
</center>
<p>What is so fascinating about this pattern is the fact that it is self-repeating. If you zoom into a small portion of the image, you would see the same pattern again.</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/mandelbrot_zoom.png" />
</figure>
</center>
<h3 id="julia-set">Julia Set</h3>
<p>Another very similar concept exists, called the “Julia Set” which exhibits similar visual \(\mathbb{K}\) diagram. Unlike Mandelbrot set, we consider a \(z\in\mathbb{C}\) to be in Julia set \(\mathbf{J}_C\) if</p>
<p>\[
\lim_{k\rightarrow\infty} \vert f^{(k)}_C(z) \vert < \infty
\]</p>
<p>Please note that this time the set is parameterized by \(C\) and we are interested in how the <em>argument of the function</em> behaves under repeated application of \(f_C(\cdot)\). Now things from here are similar. We define a similar “strength” for every \(z\in\mathbb{C}\) as</p>
<p>\[
\mathbb{K}_C(z) = \max_{\vert f^{(k)}_C(z) \vert \leq r} k
\]</p>
<p>Please note that as a result of this new definition, the \(\mathbb{K}\) diagram is parameterized by \(C\), i.e., we will get different image for different \(C\). In principle, we can visualize such images for different \(C\) (they are indeed pretty cool), but let’s go a bit further than that. We will vary \(C\) along a trajectory and produce the \(\mathbb{K}\) diagrams for each \(C\) and see them as an animation. This creates an amazing visual effect. Technically, I varied \(C\) along a circle of radius \(R = 0.75068\), i.e., \(C = R e^{j\theta}\) with \(\theta = 0\rightarrow 2\pi\)</p>
<center>
<figure>
<img width="80%" style="padding-top: 20px;" src="/public/posts_res/15/julia1.gif" />
</figure>
</center>
<p><strong>Want to know more ?</strong></p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a></li>
<li><a href="https://en.wikipedia.org/wiki/Fractal">Fractals</a></li>
</ol>
<hr />
<p>Alright then ! That is pretty much it. Due to constraint of time, space and scope its not possible to explain everything in detail in one article. There are plenty of resources available online (I have already provided some link) which might be useful in case you are interested. Feel free to explore the details of whatever new you learnt today. If you would like to reproduce the diagrams and images, please use the code here <a href="https://github.com/dasayan05/patterns-of-randomness">https://github.com/dasayan05/patterns-of-randomness</a> (sorry, the code is a bit messy, you have to figure out).</p>Ayan DasWelcome folks ! This is an article I was planning to write for a long time. I finally managed to get it done while locked at home due to the global COVID-19 situation. So, its basically something fun, interesting, attractive and hopefully understandable to most readers. To be specific, my plan is to dive into the world of finding visually appealing patterns in different sections of mathematics. I am gonna introduce you to four distinct mathematical concepts by means of which we can generate artistic patterns that are very soothing to human eyes. Most of these use random number as the underlying principle of generation. These are not necessarily very useful in real life problem solving but widely loved by artists as a tool for content creation. They are sometimes referred to as Mathematical Art. I will deliberately keep the fine-grained details out of the way so that it is reachable to a larger audience. In case you want to reproduce the content in this post, here is the code. Warning: This post contains quite heavily sized images which may take some time to load in your browser; so be patient.Neural Ordinary Differential Equation (Neural ODE)2020-03-20T00:00:00+00:002020-03-20T00:00:00+00:00https://ayandas.me/blog-tut/2020/03/20/neural-ode<p>Neural Ordinary Differential Equation (Neural ODE) is a very recent and first-of-its-kind idea that emerged in <a href="https://nips.cc/Conferences/2018">NeurIPS 2018</a>. The authors, four researchers from University of Toronto, reformulated the parameterization of deep networks with differential equations, particularly first-order ODEs. The idea evolved from the fact that ResNet, a very popular deep network, possesses quite a bit of similarity with ODEs in their core structure. <a href="https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations">The paper</a> also offered an efficient algorithm to train such ODE structures as a part of a larger computation graph. The architecture is flexible and memory efficient for learning. Being a bit non-trivial from a deep network standpoint, I decided to dedicate this article explaining it in detail, making it easier for everyone to understand. Understanding the whole algorithm requires fair bit of rigorous mathematics, specially ODEs and their algebric understanding, which I will try to cover at the beginning of the article. I also provided a (simplified) PyTorch implementation that is easy to follow.</p>
<h2 id="ordinary-differential-equations-ode">Ordinary Differential Equations (ODE)</h2>
<p><br />
\(\mathbf{Definition}\): Let’s put Neural ODEs aside for a moment and take a refresher on ODE itself. Because of their unpopularity in the deep learning community, chances are that you haven’t looked at them since high school. We will focus our discussion on first-order linear ODEs which takes a generic form of</p>
<p>\[
\frac{dx}{dt} = f(x, t)
\]</p>
<p>where \(\displaystyle{ x,t,\frac{dx}{dt} \in \mathbb{R} }\). Please recall that ODEs are differential equations that involve only one independent variable, which in our case is \(t\). Geometrically, such an ODE represents a <em>family of curves/functions</em> \(x(t)\), also called the <em>solutions</em> of the ODE. The function \(f(x, t)\), often called the <em>dynamics of the system</em>, denotes a common characteristics of all the solutions. Specifically, it denotes the first-derivative (slope) of all the solutions. An example would make things clear: let’s say the dynamics of an ODE is \(\displaystyle{ f(x, t) = 2xt }\). With the help of basic calculus, we can see the family of solutions are \(\displaystyle{ x(t) = k\cdot e^{t^2} }\) for any value of \(k\).</p>
<p><br />
\(\mathbf{System\ of\ ODEs}\): Just like any other algorithms in Deep Learning, we can (and we have to) go beyond \(\mathbb{R}\) space and eshtablish similar ODEs in higher dimension. A <em>system of ODEs</em> with dependent variables \(x_1, x_2, \cdots x_d \in \mathbb{R}\) and independent variable \(t \in \mathbb{R}\) can be written as</p>
<p>\[
\frac{dx_1}{dt} = f_1(x_1,x_2,\cdots,x_d,t); \frac{dx_2}{dt} = f_2(x_1,x_2,\cdots,x_d,t); \cdots
\]</p>
<p>With a vectorized notation of \(\mathbf{x} \triangleq [ x_1, x_2, \cdots, x_d ]^T \in \mathbb{R}^d\) and \(\mathbf{f}(\mathbf{x}) \triangleq [ f_1, f_2, \cdots, f_d ]^T \in \mathbb{R}^d\), we can write</p>
<p>\[
\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t)
\]</p>
<p>The dynamics \(\mathbf{f}(\mathbf{x}, t)\) can be seen as a <strong>Vector Field</strong> where given any \(\mathbf{x} \in \mathbb{R}^d\), \(\mathbf{f} \in \mathbb{R}^d\) denotes its gradient with respect to \(t\). The independent variable \(t\) can often be regarded as <strong>time</strong>. For example, Fig.1 shows the \(\mathbb{R}^2\) space and a dynamics \(\mathbf{f}(\mathbf{x}, t) = tanh(W\mathbf{x} + b)\) defined on it. Please note that it is a time-invariant system, i.e., the dynamics is independent of \(t\). A system with time-dependent dynamics would have a different gradient on a given \(\mathbf{x}\) depending on which time you visit it.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/vector_field.png" />
<figcaption>Fig.1: A vector field in 2D space denoting the dynamics of an ODE</figcaption>
</figure>
</center>
<p><br />
\(\mathbf{Initial\ Value\ Problem}\): Although I showed the solution of an extremely simple system with dynamics \(f(x, t) = 2xt\), most practical systems are far from it. Systems with higher dimension and complicated dynamics are very difficult to solve analytically. This is when we resort to <em>numerical methods</em>. A specific way of solving any ODE numerically is to solve an <strong>Initial Value Problem</strong> where given the system (dynamics) and an <em>initial condition</em>, one can iteratively “trace” the solution. I emphasized the term <em>trace</em> because that’s what it is. Think of it as dropping a small particle on the vector field at some point and let it <em>flow according to the gradients</em> at any point.</p>
<center>
<figure>
<img width="60%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/trace.png" />
<figcaption>Fig.2: Solving for two solutions with two different initial condition</figcaption>
</figure>
</center>
<p>Fig.2 shows two different initial condition (red dots) leads to two different curves/solution (a small segment of the curve is shown). These curves/solutions are from the family of curves represented by the system whose dynamics is shown with black arrows. Different numerical methods are available on how well we do the “tracing” and how much error we tolerate. Strating from naive ones, we have modern numerical solvers to tackle the initial value problems. We will focus on one of the simplest yet popular method known as <strong>Forward Eular’s method</strong> for the sake of simplicity. The algorithm simply does the following: It starts from a given initial state \(\mathbf{x}_0\) at \(t=0\) and literally goes in the direction of gradient at that point, i.e. \(\mathbf{f}(\mathbf{x}=\mathbf{x}_0, t=0)\) and keeps doing it till \(t=N\) using a small step size of \(\Delta t \triangleq t_{i+1} - t_i\). The following iterative update rule summerizes everything</p>
<p>\[
\mathbf{x}_{t+1} = \mathbf{x}_t + \Delta t \cdot \mathbf{f}(\mathbf{x}_t, t)
\]</p>
<p>In case you haven’t noticed, the formula can be obtained trivially from the discretized version of analytic derivative</p>
<p>\[
\mathbf{f}(\mathbf{x}, t) = \frac{d\mathbf{x}}{dt} \approx \frac{\mathbf{x}_{t+1} - \mathbf{x}_t}{\Delta t}
\]</p>
<p>If you look at Fig.2 closely enough, you would see the red curves are made up of discrete segements which is a result of solving an initial value problem using Forward Eular’s method.</p>
<h2 id="motivation-of-neural-ode">Motivation of Neural ODE</h2>
<p>Let’s look at the core structure of <a href="https://arxiv.org/abs/1512.03385">ResNet</a>, an extremely popular deep network that almost revolutionized deep network architecture. The most unique structural component of ResNet is its residual blocks that computes “increaments” on top of previous layer’s activation instead of activations directly. If the activation of layer \(t\) is \(\mathbf{h}_t\) then</p>
<p>\[ \tag{1}
\mathbf{h}_{t+1} = \mathbf{h}_t + \mathbf{F}(\mathbf{h}_t; \theta_t)
\]</p>
<p>where \(\mathbf{F}(\cdot)\) is the residual function (increament on top of last layer). I am pretty sure the reader can feel where it’s going. Yes, the residual architectire resembles the forward eular’s method on an ODE with dynamics \(\mathbf{F}(\cdot)\). Having \(N\) such residual layers is similar to executing \(N\) steps of forward eular’s method with step size \(\Delta t = 1\). The idea of Neural ODE is to “<em>parameterize the dynamics of this ODE explicitely rather than parameterizing every layer</em>”. So we can have</p>
<p>\[
\frac{d\mathbf{h}_t}{dt} = \mathbf{F}(\mathbf{h}_t, t; \theta)
\]</p>
<p>and \(N\) successive layers can be realized by \(N\)-step forward eular evaluations. As you can guess, we can choose \(N\) as per our requirement and in limiting case we can think of it as an infinite layer (\(N \rightarrow \infty\)) network. Although you must understand that such parameterization cannot provide an infinite capacity as the number of parameters is shared and finite. Fig.3 below depicts the resemblance of ResNet with forward eular iteration.</p>
<p><br /></p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/ode_core_idea.png" />
<figcaption>Fig.3: Resemblence of ResNet and Forward Eular's method.</figcaption>
</figure>
</center>
<h2 id="parameterization-and-forward-pass">Parameterization and Forward pass</h2>
<p>Although we already went over this in the last section, but let me put it more formally one more time. An “ODE Layer” is basically characterized by its dynamics function \(\mathbf{F}(\mathbf{h}_t, t; \theta)\) which can be realized by a (deep) neural network. This network takes input the “current state” \(\mathbf{h}_t\) (basically activation) and time \(t\) and produces the “direction” (i.e., \(\displaystyle{ \mathbf{F}(\mathbf{h}_t, t; \theta) = \frac{d\mathbf{h}_t}{dt} }\)) where the state should go next. A full forward pass through this layer is essentially executing an \(N\) step Forward Eular on the ODE with an “initial state” (aka “input”) \(\mathbf{h}_0\). \(N\) is a hyperparameter we choose and can be compared to “depth” in standard deep neural network. Following the original paper’s convention (with a bit of python-style syntax), we write the forward pass as</p>
<p>\[ \tag{2}
\mathbf{h}_N = \mathrm{ODESolve}(start\_state=\mathbf{h}_0, dynamics=\mathbf{F}, t\_start=0, t\_end=N; \theta)
\]</p>
<p>where the “ODESolve” is <em>any</em> iterative ODE solver algorithm and not just Forward Eular. By the end of this article you’ll understand why the specific machinery of Eular’s method is not essential.</p>
<p>Coming to the backward pass, a naive solution you might be tempted to offer is to back-propagate thorugh the operations of the solver. I mean, look at the iterative update equation Eq.1 of an ODE Solver (for now just Forward Eular) - everything is indeed differentiable ! But then, it is no better than ResNet, not from a memory cost point of view. Note that backpropagating through a ResNet (and so with any standard deep network) requires storing the intermediate activations to be used later for the backward pass. Such operation is resposible for the memory complexity of backpropagation being linear in number of layers (i.e., \(\mathcal{O}(L)\)). This is where the authors proposed a brilliant idea to make it \(\mathcal{O}(1)\) by not storing the intermediate states.</p>
<p><br /></p>
<center>
<figure>
<img width="45%" style="padding-top: 20px; border: 2px solid black;" src="/public/posts_res/14/ode_block.png" />
<figcaption>Fig.3: Block diagram of ODE Layer.</figcaption>
</figure>
</center>
<h2 id="adjoint-method-and-the-backward-pass">“Adjoint method” and the backward pass</h2>
<p>Just like any other computational graph associated with a deep network, we get a gradient signal coming from the loss. Let’s denote the incoming gradient at the end of the ODE layer as \(\displaystyle{ \frac{d \mathcal{L}}{d \mathbf{h}_N} }\), where \(\mathcal{L}\) is a scalar loss. All we have to do is use this incoming gradient to compute \(\displaystyle{ \frac{d \mathcal{L}}{d\theta} }\) and perform an SGD (or any variant) step. A bunch of parameter updates in the right direction would cause the dynamics to change and consequently the whole trajectory (i.e., trace) except the input. Fig.4 shows a graphical representation of the same. Please note that for simplicity, the loss has been calculated using \(\mathbf{h}_N\) itself. To be specific, the loss (green dotted line) is the euclidian distance between \(\mathbf{h}_N\) and its (avaialble) ground truth \(\mathbf{\widehat{h}}_N\).</p>
<p><br /></p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/optim_goal.png" />
<figcaption>Fig.4: Effect of updating parameters of the dynamics.</figcaption>
</figure>
</center>
<p>In order to accomplish our goal of computing the parameter gradients, we define a quantity \(\mathbf{a}_t\), called the “Adjoint state”</p>
<p>\[
\mathbf{a}_t \triangleq \frac{d\mathcal{L}}{d\mathbf{h}_t}
\]</p>
<p>comparing to a standard neural network, this is basically the gradient of the loss \(\mathcal{L}\) w.r.t all intermediate activations (states of the ODE). It is indeed a generalization of a quantity I mentioned earlier, i.e., the incoming gradient into the layer \(\displaystyle{ \frac{d\mathcal{L}}{d\mathbf{h}_N} = \mathbf{a}_N }\). Although we cannot compute this quantity independently for every timestep, a bit of rigorous mathematics (refer to appendix B.1 of <a href="https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations">original paper</a>) can show that the adjoint state follows a differential equation with a dynamics function</p>
<p>\[ \tag{3}
\mathbf{F}_a(\mathbf{a}_t, \mathbf{h}_t, t, \theta) \triangleq \frac{d\mathbf{a}_t}{dt} = -\mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \mathbf{h}_t}
\]</p>
<p>and that’s a good news ! We now have the dynamics that \(\mathbf{a}_t\) follows and an initial value \(\mathbf{a}_N\) (value at the extreme end \(t = N\)). That means we can run an ODE solver backward in time from \(t = N \rightarrow 0\) and calculate all \(\mathbf{a}_t\) in succession, like this</p>
<p>\[
\mathbf{a}_{N-1}, \cdots, \mathbf{a}_0 = \mathrm{ODESolve}(\mathbf{a}_N, \mathbf{F}_a, N, 0; \theta)
\]</p>
<p>Please look at Eq.2 for the signature of the “ODESolve” function. This time we also produced all intermediate states of the solver as output. An intuitive visualization of the adjoint state and its dynamics is given in Fig.5 below.</p>
<center>
<figure>
<img width="75%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/adj_viz.png" />
<figcaption>Fig.5: An intuitive visualization of the adjoint state and its dynamics.</figcaption>
</figure>
</center>
<p>The quantity on the right hand side of Eq.3 is a vector-jacobian product where \(\displaystyle{ \frac{\partial \mathbf{F}}{\partial \mathbf{h}_t} }\) is the jacobian matrix. Given the functional form of \(\mathbf{F}\), this can be readily computed using the current state \(\mathbf{h}_t\) and the latest parameter value. But wait a minute ! I said before that we are not storing the intermediate \(\mathbf{h}_t\) values. Where do we get them now ? The answer is - we can compute them again. Please remeber that we still have \(\mathbf{F}\) with us along with an extreme value \(\mathbf{h}_N\) (output of the forward pass). We can run another ODE backwards in time starting from \(t=N\rightarrow 0\). Essentially we can fuse two ODEs together</p>
<p>\[
[ \mathbf{a}_{N-1}; \mathbf{h}_{N-1} ], \cdots, [ \mathbf{a}_0; \mathbf{h}_0 ] = \mathrm{ODESolve}([ \mathbf{a}_N; \mathbf{h}_N ], [ \mathbf{F}_a; \mathbf{F} ], N, 0; \theta)
\]</p>
<p>Its basically executing two update equations for two ODEs in one “for loop” traversing from \(N\rightarrow 0\). The intermediate values of \(\mathbf{h}_t\) wouldn’t be exactly same as what we got in the forward pass (because no numerical solver is of infinite precision), but they are indeed good approximations.</p>
<p>Okay, what about the parameters of the model (dynamics) ? How do we get to our ultimate goal, \(\displaystyle{ \frac{d\mathcal{L}}{d\theta} }\) ?</p>
<p>Let’s define another quantity very similar to the adjoint state, i.e., the parameter gradient of the loss at every step of the ODE solver</p>
<p>\[
\mathbf{a}^{\theta}_t \triangleq \frac{d\mathcal{L}}{d\mathbf{\theta}_t}
\]</p>
<p>Point to note here is that \(\theta_t = \theta\) as the parameters do not change during a trajectory. Instead, these quantities signify <em>local influences</em> of the parameters at each step of computation. This is very similar to a roll-out of RNN in time where parameters are shared accross time steps. With a proof very similar to that of the adjoint state, it can be shown that</p>
<p>\[
\mathbf{a}^{\theta}_t = \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta}
\]</p>
<p>just like shared-weight RNNs, we can compute the full parameter gradient as combination of local influences</p>
<p>\[\tag{4}
\frac{d\mathcal{L}}{d\theta} = \int_{0}^{N} \mathbf{a}^{\theta}_t dt = \int_{0}^{N} \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} dt
\]</p>
<p>The quantity \(\displaystyle{ \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} }\) is another vector-jacobian product and can be evaluted using the values of \(\mathbf{h}_t\), \(\mathbf{a}_t\) and latest parameter \(\theta\). So do we need another pass over the whole trajectory as Eq.4 consists of a integral ? <strong>Fortunately, NO</strong>. Let me bring your attention to the fact that whatever we need to compute this vector-jacobian is already being computed in the fused ODE we saw before. Furthermore we can tweak the Eq.4 as</p>
<p>\[\tag{5}
\frac{d\mathcal{L}}{d\theta} = \mathbf{0} - \int_{N}^{0} \mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} dt
\]</p>
<p>I hope you are seeing what I am seeing. This is equivalent to solving yet another ODE (backwards in time, again!) with dynamics \(\displaystyle{ \mathbf{F}_{\theta}(\mathbf{a}_t, \mathbf{h}_t, \theta, t) \triangleq -\mathbf{a}_t \frac{\partial \mathbf{F}}{\partial \theta} }\) and initial state \(\mathbf{a}^{\theta}_N = \mathbf{0}\). The end state \(\mathbf{a}^{\theta}_0\) of this ODE completes the whole integral in Eq.5 and therefore is equal to \(\displaystyle{\frac{d\mathcal{L}}{d\theta}}\). Just like last time, we can fuse this ODE with the last two combined</p>
<p>\[
[ \mathbf{a}_{N-1}; \mathbf{h}_{N-1}; \_ ], \cdots, [ \mathbf{a}_0; \mathbf{h}_0; \mathbf{a}^{\theta}_0 ] = \mathrm{ODESolve}([ \mathbf{a}_N; \mathbf{h}_N; \mathbf{0} ], [ \mathbf{F}_a; \mathbf{F}; \mathbf{F}_{\theta} ], N, 0; \theta)
\]</p>
<center>
<figure>
<img width="85%" style="padding-top: 20px; border: 0px solid black;" src="/public/posts_res/14/full_diagram.png" />
<figcaption>Fig.6: A pictorial representation of the forward and backward pass with all its ODEs.</figcaption>
</figure>
</center>
<p>Take some time to digest the final 3-way ODE and make sure you get it. Because that is pretty much it. Once we get the parameter gradient, we can continue with normal stochastic gradient update rule (SGD or family). Additionally you may want to pass \(\mathbf{a}_0\) to the computation graph that comes before our ODE layer. A representative diagram containing a clear picture of all the ODEs and their interdependencies are shown above.</p>
<h2 id="pytorch-implementation">PyTorch Implementation</h2>
<p>Implementing this algorithm is a bit tricky due to its non-conventional approach for gradient computations. Specially if you are using library like PyTorch which adheres to a specific model of computation. I am providing a very simplified implementation of ODE Layer as a PyTorch <code class="language-plaintext highlighter-rouge">nn.Module</code>. Because this post has already become quite long and stuffed with maths and new concepts, I am leaving it here. I am putting the core part of the code (well commented) here just for reference but a complete application can be found on this <a href="https://github.com/dasayan05/neuralode-pytorch">GitHub repo of mine</a>. My implementation is quite simplified as I have hard-coded “Forward Eular” method as the only choice of ODE solver. Feel free to contribute to my repo.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#############################################################
# Full code at https://github.com/dasayan05/neuralode-pytorch
#############################################################
</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">ODELayerFunc</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">Function</span><span class="p">):</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="n">z0</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">dynamics</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">):</span>
<span class="n">delta_t</span> <span class="o">=</span> <span class="n">t_range_forward</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">t_range_forward</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># get the step size
</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">z0</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
<span class="k">for</span> <span class="n">tf</span> <span class="ow">in</span> <span class="n">t_range_forward</span><span class="p">:</span> <span class="c1"># Forward eular's method
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">dynamics</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">tf</span><span class="p">)</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">zt</span> <span class="o">+</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">f</span> <span class="c1"># update
</span>
<span class="n">context</span><span class="p">.</span><span class="n">save_for_backward</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">delta_t</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">)</span>
<span class="n">context</span><span class="p">.</span><span class="n">dynamics</span> <span class="o">=</span> <span class="n">dynamics</span> <span class="c1"># 'save_for_backwards() won't take it, so..
</span>
<span class="k">return</span> <span class="n">zt</span> <span class="c1"># final evaluation of 'zt', i.e., zT
</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="n">adj_end</span><span class="p">):</span>
<span class="c1"># Unpack the stuff saved in forward pass
</span> <span class="n">zT</span><span class="p">,</span> <span class="n">t_range_forward</span><span class="p">,</span> <span class="n">delta_t</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">saved_tensors</span>
<span class="n">dynamics</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">dynamics</span>
<span class="n">t_range_backward</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">t_range_forward</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">,])</span> <span class="c1"># Time runs backward
</span>
<span class="n">zt</span> <span class="o">=</span> <span class="n">zT</span><span class="p">.</span><span class="n">clone</span><span class="p">().</span><span class="n">requires_grad_</span><span class="p">()</span>
<span class="n">adjoint</span> <span class="o">=</span> <span class="n">adj_end</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
<span class="n">dLdp</span> <span class="o">=</span> <span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">theta</span><span class="p">]</span> <span class="c1"># Parameter grads (an accumulator)
</span>
<span class="k">for</span> <span class="n">tb</span> <span class="ow">in</span> <span class="n">t_range_backward</span><span class="p">:</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">set_grad_enabled</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
<span class="c1"># above 'set_grad_enabled()' is required for the graph to be created ...
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">dynamics</span><span class="p">(</span><span class="n">zt</span><span class="p">,</span> <span class="n">tb</span><span class="p">)</span>
<span class="c1"># ... and be able to compute all vector-jacobian products
</span> <span class="n">adjoint_dynamics</span><span class="p">,</span> <span class="o">*</span><span class="n">dldp_</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">([</span><span class="o">-</span><span class="n">f</span><span class="p">],</span> <span class="p">[</span><span class="n">zt</span><span class="p">,</span> <span class="o">*</span><span class="n">theta</span><span class="p">],</span> <span class="n">grad_outputs</span><span class="o">=</span><span class="p">[</span><span class="n">adjoint</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dldp_</span><span class="p">):</span>
<span class="n">dLdp</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">dLdp</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">p</span> <span class="c1"># update param grads
</span> <span class="n">adjoint</span> <span class="o">=</span> <span class="n">adjoint</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">adjoint_dynamics</span> <span class="c1"># update the adjoint
</span> <span class="n">zt</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">zt</span><span class="p">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">delta_t</span> <span class="o">*</span> <span class="n">f</span><span class="p">.</span><span class="n">data</span> <span class="c1"># Forward eular's (backward in time)
</span>
<span class="k">return</span> <span class="p">(</span><span class="n">adjoint</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="o">*</span><span class="n">dLdp</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">ODELayer</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dynamics</span><span class="p">,</span> <span class="n">t_start</span> <span class="o">=</span> <span class="mf">0.</span><span class="p">,</span> <span class="n">t_end</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">,</span> <span class="n">granularity</span> <span class="o">=</span> <span class="mi">25</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span> <span class="o">=</span> <span class="n">dynamics</span>
<span class="bp">self</span><span class="p">.</span><span class="n">t_start</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_end</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">granularity</span> <span class="o">=</span> <span class="n">t_start</span><span class="p">,</span> <span class="n">t_end</span><span class="p">,</span> <span class="n">granularity</span>
<span class="bp">self</span><span class="p">.</span><span class="n">t_range</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">t_start</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_end</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">granularity</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="k">return</span> <span class="n">ODELayerFunc</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">t_range</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span><span class="p">,</span> <span class="o">*</span><span class="bp">self</span><span class="p">.</span><span class="n">dynamics</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
</code></pre></div></div>
<p>That’s all for today. See you.</p>Ayan DasNeural Ordinary Differential Equation (Neural ODE) is a very recent and first-of-its-kind idea that emerged in NeurIPS 2018. The authors, four researchers from University of Toronto, reformulated the parameterization of deep networks with differential equations, particularly first-order ODEs. The idea evolved from the fact that ResNet, a very popular deep network, possesses quite a bit of similarity with ODEs in their core structure. The paper also offered an efficient algorithm to train such ODE structures as a part of a larger computation graph. The architecture is flexible and memory efficient for learning. Being a bit non-trivial from a deep network standpoint, I decided to dedicate this article explaining it in detail, making it easier for everyone to understand. Understanding the whole algorithm requires fair bit of rigorous mathematics, specially ODEs and their algebric understanding, which I will try to cover at the beginning of the article. I also provided a (simplified) PyTorch implementation that is easy to follow.