Tip

Please note that this blog is published as part of the **BlogPost Track** in International Conference on Learning Representation (ICLR) 2024. To read the official version, click on the button below.

In case you haven’t checked the first part of this two-part blog, please read Score Based Models (SBMs).

In thermodynamics, “Diffusion” refers to the flow of particles from high-density regions towards low-density regions. In the context of statistics, meaning of Diffusion is quite similar, i.e. the process of transforming a complex distribution on to a simple (predefined) distribution on the same domain. Succinctly, a transformation such that

where the symbol means “implies”. There is a formal way to come up with a specific specific pair that satisfies Equation 1 for *any* distribution . In simple terms, we can take *any* distribution and transform it into a known (simple) density by means of a known transformation . By “formal way”, I was referring to Markov Chain and its stationary distribution, which says that by **repeated application** of a transition kernel on the samples of *any* distribution would lead to samples from if the following holds

We can related our original diffusion process in Equation 1 with a markov chain by defining to be repeated application of the transition kernel over discrete time

From the properties of stationary distribution, we have . In practice, we can keep the iterations to a sufficiently large finite number .

So far, we confirmed that there is indeed an iterative way (refer to Equation 2) to convert the samples from a complex distributions to a known (simple) prior. Even though we talked only in terms of generic densities, there is one very attractive choice of pair (showed in Sohl-Dickstein et al. (2015)) due to its simplicity and tractability

For obvious reason, its known as **Gaussian Diffusion**. I purposefuly changed the notations of the random variables to make it more explicit. is a predefined decaying schedule proposed by Sohl-Dickstein et al. (2015). A pictorial depiction of the diffusion process is shown in Figure 1.

We proved the existence of a stochastic transform that gurantees the diffusion process in Eq 1. Please realize that the diffusion process does not depend on the initial density (as ) and the only requirement is to be able to sample from it. This is the core idea behind Diffusion Models - we use the any data distribution (let’s say ) of our choice as the complex initial density. This leads to the **forward diffusion** process

This process is responsible for “destructuring” the data and turning it into an isotropic gaussian (almost structureless). Please refer to the figure below (red part) for a visual demonstration.

However, this isn’t very usefull by itself. What would be useful is doing the opposite, i.e. starting from isotropic gaussian noise and turning it into - that is generative modelling (blue part of the figure above). Since the forward process is fixed (non-parametric) and guranteed to exist, it is very much possible to invert it. Once inverted, we can use it as a generative model as follows

Fortunately, the theroy of markov chain gurantees that for gaussian diffusion, there indeed exists a **reverse diffusion** process . The original paper from Sohl-Dickstein et al. (2015) shows how a parametric model of diffusion can be learned from data itself.

The stochastic “forward diffusion” and “reverse diffusion” processes described so far can be well expressed in terms of Probabilistic Graphical Models (PGMs). A series of random variables define each of them; with the forward process being fully described by Equation 3. The reverse process is expressed by a parametric graphical model very similar to that of the forward process, but in reverse

Each of the reverse conditionals are structurally gaussians and responsible for learning to revert each corresponding steps in the forward process, i.e. . The means and covariances of these reverse conditionals are neural networks with parameters and shared over timesteps. Just like any other probabilistic models, we wish to minimize the negative log-likelihood of the model distribution under the expectation of data distribution

which isn’t quite computable in practice due to its dependance on more random variables. With a fair bit of mathematical manipulations, Sohl-Dickstein et al. (2015) (section 2.3) showed to be a lower-bound of another easily computatable quantity

which is easy to compute and optimize. The expectation is over the joint distribution of the entire forward process. Getting a sample boils down to executing the forward diffusion on one sample . All quantities inside the expectation are tractable and available to us in closed form.

Even though we can train the model with the lower-bound shown above, few more simplifications are possible. First one is due to Sohl-Dickstein et al. (2015) and in an attempt to reduce variance. Firstly, they showed that the lower-bound can be further simplified and re-written as the following

There is a subtle approximation involved (the edge case of in the summation) in the above expression which is due to Ho, Jain, and Abbeel (2020) (section 3.3 and 3.4). The noticable change in this version is the fact that all conditionals of the forward process are now additionally conditioned on . Earlier, the corresponding quantities had high uncertainty/variance due to different possible choices of the starting point , which are now suppressed by the additional knowledge of . Moreover, it turns out that has a closed form

The exact form (refer to Equation 7 of Ho, Jain, and Abbeel (2020)) is not important for holistic understanding of the algorithm. Only thing to note is that additionally contains (fixed numbers) and is a function of only. Moving on, we do the following on the last expression for

- Use the closed form of in Equation 4 with (design choice for making things simple)
- Expand the KL divergence formula
- Convert into expectation (over ) by scaling with a constant

.. and arrive at a simpler form

For the second simplification, we look at the forward process in a bit more detail. There is an amazing property of the forward diffusion with gaussian noise - the distribution of the noisy sample can be readily calculated given real data without touching any other steps.

This is a consequence of the forward process being completely known and having well-defined probabilistic structure (gaussian noise). By means of (gaussian) reparameterization, we can also derive an easy way of sampling any only from standard gaussian noise vector

As a result, need not be sampled with ancestral sampling (refer to Equation 2 & 3), but only require computing Equation 6 with all in **any order**. This further simpifies the expectation in Equation 5 to (changes highlighted in blue)

This is the final form that can be implemented in practice as suggested by Ho, Jain, and Abbeel (2020).

Ho, Jain, and Abbeel (2020) uncovered a link between Equation 7 and a particular Score-based models known as Noise Conditioned Score Network (NCSN) (Song and Ermon 2019). With the help of the reparameterized form of (Equation 6) and the functional form of , one can easily (with few simplification steps) reduce Equation 7 to

The above equation is a simple regression with being the parametric model (neural net in practice) and the quantity in blue is its regression target. Without loss of generality, we can slightly modify the definition of the parametric model to be . The only “moving part” in the new parameterization is ; the rest (i.e. and ) are explicitly available to the model. This leads to the following form

The expression in red can be discarded without any effect in performance (suggested by Ho, Jain, and Abbeel (2020)). I have further approximated the expectation over time-steps with sample average. If you look at the final form, you may notice a surprising resemblance with Noise Conditioned Score Network (NCSN) (Song and Ermon 2019). Please refer to in my blog on score models. Below I pin-point the specifics:

- The time-steps resemble the increasing “noise-scales” in NCSN. Recall that the noise increases as the forward diffusion approaches the end.
- The expectation (for each scale) holistically matches that of denoising score matching objective, i.e. . In case of Diffusion Model, The noisy sample can be readily computed using the noise vector (refer to Equation 6).
- Just like NCSN, the regression target is the noise vector for each time-step (or scale).
- Just like NCSN, the learnable model depends on the noisy sample and the time-step (or scale).

Inspired by the connection between Diffusion Model and Score-based models, Song, Kingma, et al. (2021) proposed to use infinitely many noise scales (equivalently time-steps). At first, it might look like a trivial increase in number of steps/scales, but there happened to be a principled way to achieve it, namely Stochastic Differential Equations or SDEs. Song, Kingma, et al. (2021) reworked the whole formulation considering continuous SDE as forward diffusion. Interestingly, it turned out that the reverse process is also an SDE that depends on the score function.

Quite simply, finite time-steps/scales (i.e. ) are replaced by infinitely many segments (of length ) within time-range . Instead of at every discrete time-step/scale, we define a continuous random process indexed by continuous time . We also replace the discrete-time conditionals in Equation 3 with continuous analogues. But this time, we define the “increaments” in each step rather than absolute values, i.e. the transition kernel specifies *what to add* to the previous value. Specifically, we define a general form of **continuous forwrad diffusion** with

If you have ever studied SDEs, you might recognize that Equation 8 resembles Euler–Maruyama numerical solver for SDEs. Considering to be the “Drift function”, be the “Diffusion function” and being the discrete differential of Wiener Process , in the limit of , the following SDE can be recovered

A visualization of the continuous forward diffusion in 1D is given in Figure 3 for a set of samples (different colors).

Song, Kingma, et al. (2021) (section 3.4) proposed few different choices named Variance Exploding (VE), Variance Preserving (VP) and sub-VP. The one that resembles Equation 3 (discrete forward diffusion) in continuous time and ensures proper diffusion, i.e. is . This particular SDE is termed as “Variance Preserving (VP) SDE” since the variance of is finite as long as the variance of if finite (Appendix B of Song, Kingma, et al. (2021)). We can enforce the covariance of to be simply by standardizing our dataset.

An old (but remarkable) result due to Anderson (1982) shows that the above forward diffusion can be reversed even in closed form, thanks to the following SDE

Hence, the **reverse diffusion** is simply solving the above SDE in reverse time with initial state , leading to . The only missing part is the score, i.e. . Thankfully, we have already seen how score estimation works and that is pretty much what we do here. There are two ways, as explained in my blog on score models. I briefly go over them below in the context of continuous SDEs:

The *easier* way is to use the Hutchinson trace-estimator based score matching proposed by Song et al. (2020) called “Sliced Score Matching”.

Very similar to NCSN, we define a parametric score network dependent on continuous time/scale . Starting from data samples , we can generate the rest of the forward chain simply by executing a solver (refer to Equation 8) on the SDE at any required precision (discretization).

There is the other “Denoising score matching (DSM)” way of training , which is slightly more complicated. At its core, the DSM objective for continuous diffusion is a continuous analogue of the discrete DSM objective.

Remember that in case of continuous diffusion, we never explicitly modelled the reverse conditionals . The reverse diffusion was defined rather implicitly (Equation 9). Hence, the quantity in blue, unlike its discrete counterpart, isn’t very easy to compute *in general*. However, due to Särkkä and Solin (2019) there is an easy closed form for it when the drift function is **affine** in nature. Thankfully, our specific choice of is indeed affine.

Since the conditionals are gaussian (again), its pretty easy to derive the expression for . I leave it for the readers to try.

One of the core reasons score models exist is the fact that it bypasses the need for training explicit log-likelihoods which are difficult to compute for a large range of powerful models. Turns out that in case of continuous diffusion models, there is an indirect way to evaluate the very log-likelihood. Let’s focus on the “generative process” of continuous diffusion models, i.e. the **reverse diffusion** in Equation 9. What we would like to compute is when is generated by solving the SDE in Equation 9 backwards with . Even though it is hard to compute marginal likelihoods for any , it turns out there is exists a **deterministic ODE (Ordinary Differential Equation)** against the SDE in Equation 9 whose marginal likelihoods *match* that of the SDE for every

Note that the above ODE is essentially the same SDE but without the source of randomness. After learning the score (as usual), we simply drop-in replace the SDE with the above ODE. Now all thanks to Chen et al. (2018), this problem has already been solved. It is known as Continuous Normalizing Flow (CNF) whereby given , we can calculate by solving the following ODE with any numerical solver for

Please remember that this way of computing log-likelihood is merely an utility and cannot be used to train the model. A more recent paper (Song, Durkan, et al. 2021) however, shows a way to train SDE based continuous diffusion models by directly optimizing (a bound on) log-likelihood under some condition, which may be the topic for another article. I encourage readers to explore it themselves.

Anderson, Brian DO. 1982. “Reverse-Time Diffusion Equation Models.” *Stochastic Processes and Their Applications* 12 (3): 313–26.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.” In *NeurIPS*.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” In *NeurIPS*.

Särkkä, Simo, and Arno Solin. 2019. *Applied Stochastic Differential Equations*. Vol. 10. Cambridge University Press.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” In *ICML*.

Song, Yang, Conor Durkan, Iain Murray, and Stefano Ermon. 2021. “Maximum Likelihood Training of Score-Based Diffusion Models.” In *Advances in Neural Information Processing Systems*, edited by A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan.

Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by Estimating Gradients of the Data Distribution.” In *Advances in Neural Information Processing Systems*, 11895–907.

Song, Yang, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. 2020. “Sliced Score Matching: A Scalable Approach to Density and Score Estimation.” In *Uncertainty in Artificial Intelligence*, 574–84. PMLR.

Song, Yang, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In *ICLR*.

BibTeX citation:

```
@online{das2021,
author = {Das, Ayan},
title = {An Introduction to {Diffusion} {Probabilistic} {Models}},
date = {2021-12-04},
url = {https://ayandas.me//blogs/2021-12-04-diffusion-prob-models.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2021. “An Introduction to Diffusion Probabilistic
Models.” December 4, 2021. https://ayandas.me//blogs/2021-12-04-diffusion-prob-models.html.

The next part of this two-part blog is Diffusion Probabilistic Models

Traditional log-likelihood based approaches define a parametric generative process in terms of graphical model and maximize the joint density w.r.t its parameters

The joint density is often quite complex and sometimes intractable. For intractable cases, we maximize a surrogate objective based on e.g. Variational Inference. We achieve the above in practice by moving the parameters in the direction where the expected log-likelihood increases the most at every step . The expectation is computed empirically at points sampled from our dataset, i.e. the unknown data distribution

With a trained set of parameters , we sample from the model with ancestral sampling by exploiting its graphical structure

There is one annoying requirement both in (1) and (2): the parametric model must be a valid density. We ensure such requirement by building the model only carefully combining known densities like Gaussian, Bernoulli, Dirichlet etc. Even though they are largly sufficient in terms of expressiveness, it might feel a bit too restrictive from system designer’s perspective.

A new and emerging class of generative model, namely “Score based models (SBM)” entirely sidesteps the log-likelihood modelling and approaches the problem in a different way. In specific, SBMs attempt to learn a *navigation map* on the data space which guides any point on that space to reach a region highly probable under the data distribution . A little but careful though on this would lead us to something formally known as the *Score function*. The “Score” of an arbitrary point on the data space is essentially the gradient of the *true* data log-likelihood on that point

Please be careful and notice that the quantity on the right hand side of (3), i.e. is **not** same as , the quantity we encountered earlier (in MLE setup), even though they look structurally similar.

Given *any* point on the data space, the score tells us which direction to navigate if we would like see a region with higher likelihood. Unsurprisingly, if we take a little step toward the direction suggested by the score, we get a point that is slightly more likely under . This is why I termed it as a “navigation map”, as in a guiding document that tells us the direction of the “treasure” (i.e. real data samples). All an SBM does is try to approximate the true score function via a parametric proxy

As simple as it might sound, we construct a regression problem with as regression targets. We minimize the following loss

This is known as *Score Matching*. Once trained, we simply keep moving in the direction suggested by starting from any random over finite time horizon . In practice, we move with a little bit of stochasticity - the formal procedure is known as *Langevin Dynamics*.

is the injected gaussian noise. If as , this process gurantees to be a true sample from . In practice, we run this process for finite number of steps and assign according to a decaying schedule. Please refer to the original paper for detailed discussion.

Looks all good. But, there are two problems with optimizing .

**Problem 1:**The very obvious one; we don’t have access to the true scores . No one knows the exact form of .**Problem 2:**The not-so-obvious one; the expection is a bit problematic. Ideally, the objective must encourage learning the scores all over the data space (i.e. for every ). But this isn’t possible with an expectation over only the data distribution. The regions of the data space which are unlikely under do not get enough supervisory signals.

Aapo Hyvärinen, 2005 solved the first problem quite elegantly and proposed the *Implicit Score Matching* objective and showed it to be equivalent to under some mild regulatory conditions. The following remarkable result was shown in the original paper

The reason it’s known to be “remarkable” is the fact that does not require the true target scores *at all*. All we need is to compute an expectation w.r.t the data distribution which can be implemented using finite samples from our dataset. One practical problem with this objective is the amount of computation involved in the jacobian . Later, Song et al., 2019 proposed to use the Hutchinson’s trace estimator, a stochastic estimator for computing the trace of a matrix, which simplified the objective further

where is a standard multivariate gaussian RV. This objective is computationally advantageous when used in conjunction with automatic differentiation frameworks (e.g. PyTorch) which can efficiently compute the vector-jacobian product (VJP), namely .

In a different approach, Pascal Vincent, 2011 investigated the “unsuspected link” between Score Matching and Denoising Autoencoders. This work led to a very efficient and practical objective that is used even in the cutting-edge Score based models. Termed as “Denoising Score Matching (DSM)”, this approach mitigates both problem 1 & 2 described above and does so quite elegantly.

To get rid of problem 2, DSM proposes to simply use a noise-perturbed version of the dataset, i.e. replace with where

The above equation basically tells us to create a perturbed/corrupted version of the original dataset by adding simple isotropic gaussian noise whose streangth is controlled by , the std deviation of the gaussian. Since gaussian distribution is spanned over the entire space , corrupted data samples populate much more region of the entire space and help the parameterized score function learn at regions which were originally unreachable under . The denoising objective simply becomes

With a crucial proof shown in the appendix of the original paper, we can have an equivalent (changes shown in magenta) version of as

Note that we now need original-corrupt data pairs in order to compute the expectation, which is quite trivial to do. Also realize that the term is not the data score but related only to the pre-specified noise model with quite an easy analytic form

The score function we learn this way isn’t actually for our original data distribution , but rather for the corrupted data distribution . The strength decided how well it aligns with the original distribution. If is large, we end up learning too corrupted version of the data distribution; if is small, we no longer get the nice property out of the noise perturbation - so there is a trade-off. Recently, this trade-off has been utilized for learning robust score based models.

Moreover, Eq. 5 has a very intuitive interpretation and this is where Pascal Vincent, 2011 uncovered the link between DSM and Denoising Autoencoders. A closer look at Eq. 5 would reveal that the has a learning target of , which can be interpreted as an unit vector from corrupted sample towards the real sample. Succintly, the score function is trying to learn how to “de-noise” a corrupted sample - that’s precisely what Denoising Autoencoders do.

The idea presented in Song et al., 2020 is to have different noise-perturbed data distributions (with different ) and one score function for each of them. The noise strengths are chosen to be , so that is the most corrupted distribution and is the least. Also, instead of having separate score functions, we use one shared score function conditioned on , i.e. .

We finally learn the shared score function from the ensamble of distributions

where is same as Eq. 5 but uses the shared score network parameterized by

In order to sample, Song et al., 2020 proposed a modified version of Langevin Dynamics termed as “Annealed Langevin Dynamics”. The idea is simple: we start from a random sample and run the Langevin Dynamics (Eq. 4) using for steps. We use the final sample as the initial starting point for the next Langevin Dynamics with . We repeat this process till we get the final sample from . The intuition here is to sample at coarse level first and gradually fine-tune it to get high quality samples. The exact algorithm is depicted in Algorithm 1 of Song et al., 2020.

Recently, Song et al., 2021 have established a surprising connection between Score Models, Diffusion Models and Stochastic Differential Equation (SDEs). Diffusion Models are another rising class of generative models fundamentally similar to score based models but with some notable differences. Since we did not discuss Diffusion Models in this article, we cannot fully explain the connection and how to properly utilize it. However, I would like to show a brief preview of where exactly SDEs show up within the material discussed in this article.

Stochastic Differential Equations (SDEs) are stochastic dymanical systems with state , characterized by a *Drift function* and a *Diffusion function*

where denotes the Wiener Process and . Discritizing the above equation in time yields

To find a connection now, it is only a matter of comparing Eq. 6 with Eq. 4. The sampling process defined by Langevin Dynamics is essentially an SDE discretized in time with

In another future article named An introduction to Diffusion Probabilistic Models, we explored Diffusion Models along with their connection to SDEs.

BibTeX citation:

```
@online{das2021,
author = {Das, Ayan},
title = {Generative Modelling with {Score} {Functions}},
date = {2021-07-14},
url = {https://ayandas.me//blogs/2021-07-14-generative-model-score-function.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2021. “Generative Modelling with Score
Functions.” July 14, 2021. https://ayandas.me//blogs/2021-07-14-generative-model-score-function.html.

This is easy to answer but just for the sake of completeness - we are interested in computing derivatives of a function because of its requirement in the update rule of Gradient Descent (or any of its successor):

Where $ is the set of all parameters, is the data and is the function (typically loss) we want to differentiate. Our ultimate goal is to compute given the *structural form* of . The standard way of doing this is to use “Automatic Differentiation” (AutoDiff or AD), or rather, a special case of it called “Backpropagation”. It is called Backpropagation only when the function is scalar, which is mostly true in cases we care about.

We will now see how gradients of a complex function (given its full specification) can be computed as a sequence of primitive operations. Let’s explain this with an example for simplicity: We have two inputs (just symbols) and a description of the *scalar* function we want to differentiate:

We can think of as a series of smaller computations with intermediate results, like this

I changed the pure math notations to more programmatic ones; but the meaning remains same. In order to compute gradients, we *augment* these computations and create something called a “pullback” function as an additional by-product.

Mathematically, the actual computation and pullback creation can be written together symbolically as:

You can think of the *functional* as the internals of the Backpropagation framework which mutates all the computation units to produce an extra entity. A pullback function () is a function that takes input the gradient w.r.t the output of the corresponding function and returns the gradient w.r.t inputs of the function:

It is really nothing but a different view of the chain-rule of differentiation:

We must also realize that computing may require values from the forward pass. For example, computing may need evaluating at the given value of . After getting access to , we can compute the derivatives of w.r.t by invoking the pullback functions in proper (reverse) order

Please note that is actually and hence .

There are couple of different ways of implementing the theory described above. The de-facto way of doing it (as of this day) is something known as “tape-based” implementation. `PyTorch`

and `Tensorflow Eager Execution`

are probably the most popular example of this type.

In tape-based systems, the function is specified by its full structural form. Moreover, it requires *runtime execution* in order to compute anything (be it output or derivatives). Such system keeps track of every computation via a recorder or “tape” (that’s why the name) and builds an internal computation graph. Later, when requested, the tape stops recording and works its way backwards through the recorded tape to compute derivatives.

A tape-based system requires users to provide the function as a description of its computations following a certain guidelines. These guidelines are provided by the specific AutoDiff framework we use. Take `PyTorch`

for example - we write the series of computations using the API provided by `PyTorch`

:

```
import torch
class Network(torch.nn.Module):
def __init__(self):
self.b0 = ...
def forward(self, a):
y1 = torch.pow(self.b0, 2)
y2 = torch.sum(1, y1)
y3 = torch.div(a, y2)
return y3
```

Think of the framework as an entity which is solely responsible for doing all the derivative computations. You just can’t be careless to use `math.sum()`

(or anything) instead `torch.sum()`

, or omit the base class `torch.nn.Module`

. You have to stick to the guidelines `PyTorch`

laid out to be able to make use of it. When done with the definition, we can run forward and backward pass like using actual data

```
model = Network(...)
F = model(a0)
F.backward()
# 'model.b0.grad' & 'a0.grad' available
```

This will cause the framework to trigger the following sequence of computations one after another

The first and last 3 lines of computation are the “forward pass” and the “backward pass” of the model respectively. Frameworks like `PyTorch`

and `Tensorflow`

typically works in this way when `.forward()`

and `.backward()`

calls are made in succession. Point to be noted that since we are explicitly executing a forward pass, it will cache the necessary values required for executing the pullbacks in the backward pass. An overall diagram is shown below for clarification.

As of now, it might not seem that big of a problem for regular PyTorch user (me included). The problem intensifies when you have a non-ML code base with a complicated physics model (for example) like this

```
import math
from other_part_of_my_model import sub_part
def helper_function(x):
if something:
return helper_function(sub_part(x-1)) # recursive call
...
def complex_physics_model(parameters, input):
math.do_awesome_thing(parameters, helper_function(input))
...
return output
```

.. and you want to use it within your `PyTorch`

model and differentiate it. There is no way you can do this so easily without spending your time to `PyTorch`

-ify it first.

There is another serious problem with this approach: the framework cannot “*see*” any computation ahead of time. For example, when the execution thread reaches the `torch.sum()`

function, it has no idea that it is about to encounter `torch.div()`

. The reason its important is because the framework has no way of optimizing the computation - it *has to* execute the exact sequence of computations verbatim. For example, if the function description is given as , this type of framework will waste its resources executing lots of operations which will ultimately yield (both in forward and backward direction) something trivial.

Differentiable Programming (DiffProg) offers a very elegant solution to both the problems described in the previous section. **DiffProg allows you to write arbitrary code without following any guidelines and still be able to differentiate it.** At the current state of DiffProg, majority of the successful systems use something called “

Source code transformation is a technique used extensively in the field of Compiler Designing. It takes a piece of code written in some high-level language (like C++, Python etc.) and emits a *compiled* version of it typically in a relatively lower level language (like Assembly, Bytecode, IRs etc.). Specifically, the input to a DiffProg system is a description of as *source code* written in some language with defined input/output. The output of the system is the source code of the derivative of w.r.t its inputs (i.e., ). The input program has full liberty to use the native primitives of the programming language like built-in functions, conditional statements, recursion, `struct`

-like facilities, memory read/write constructs and pretty much anything that the language offers.

Using our generic notation, we can write down such a system as

where and are the given function and its derivative function in the form of *source codes* (bare with me if it doesn’t make sense at this point). Just like before, the *source code* for pullback may require some intermediate variables from that of . For a concrete example, the following is be a (hypothetical) valid DiffProg system:

```
>>> # following string contains the source code of F(.)
>>> input_prog = """
def F(a, b):
y1 = b ** 2
y2 = 1 + y1
return a / y2
"""
>>> y, B = diff_prog(input_prog, a=1., b=2.)
>>> print(y)
0.2
>>> exec(B) # get the derivative function as a live object in current session
>>> print(dF(1.)) # 'df()' is defined in source code 'B'
(0.2, -0.16)
```

Please pay attention to the fact that both our problems discussed in tape-based system are effectively solved now:

We no longer need to be under the umbrella of a framework as we can directly work with native code. In the above example, the source code of the given function is simply written in native python. The example shows the overall pullback source-code (i.e.,

`B`

) and also its explicitly compiled form (i.e.,`dF`

). Optionally, a DiffProg system can produce readily compiled derivative function.The DiffProg system can “see” the whole source-code at once, giving it the opportunity to run various optimizations. As a result, both the given program the derivative program could be much faster than the standard tape-based approaches.

Although I showed the examples in Python for ease of understanding but it doesn’t really have to be Python. The theory of DiffProg is very general and can be adopted to any language. In fact, Python is NOT the language of choice for some of the first successful DiffProg systems. The one we are gonna talk about is written in a relatively new language called Julia. The Julia language and its compiler provides an excellent support for meta-programming, i.e. manipulating/analysing/constructing Julia programs within itself. This allows Julia to offer a DiffProg system that is much more flexible than naively parsing strings (like the toy example shown above). We will look into the specifics of the Julia language and its DiffProg framework called “Zygote” later in this article. But before that, we will look at few details about the general compiler machinery that is required to implement DiffProg systems. Since this article is mostly targetted towards people from ML/DL background, I will try my best to be reasonable about the details of compiler designing.

A compiler (or compiler-like system) analyses a given source code by parsing it as string. Then, it creates a large and complex data structure (known as AST) containing control flow, conditionals and every fundamental language constructs. Such structure is further compiled down to a relatively low-level representation comprising the core flow of a source program. This low-level code is known as the “Intermediate Representation (IR)”. One of its fundamental purpose is to replaces all unique variable names with a unique ID. A given source code like this

```
function F(a, b)
y1 = b ^ 2
y1 = 1 + y1
return a / y1
```

a compiler can turn it into an IR (hypothetical) like

```
function F(%1, %2)
%3 = %2 ^ 2
%3 = 1 + %3
return %1 / %3
```

where `%N`

is a unique placeholder for a variable. However, this particular form is a little inconvenient to analyse in practice due to the possibility of a symbol redefinition (e.g. the variable `y1`

in above example). Modern compilers (including Julia) use a little improved IR, called “*SSA (Static Single Assignment) form IR*” which assigns one variable only once and often introduces extra unique symbols in order to achieve that.

```
function F(%1, %2)
%3 = %2 ^ 2
%4 = 1 + %3
return %1 / %4
```

Please note how it used an extra unique ID (i.e. `%4`

) in order to avoid re-assignment (of `%3`

). It has been shown that such SSA-form IR (rather than direct source code) can be differentiated, and a corresponding “Derivative IR” can be retrieved. The obvious way of crafting the derivative IR of is to use the Derivative IRs of its constituent operations, similar to what is done in tape-based method. The biggest difference is the fact that everything is now in terms of source codes (or rather IR to be precise). The compiler could craft the derivative program like this

```
function dF(%1, %2)
# IR for forward pass
%3, B1 = J(pow, %2, 2)
%4, B2 = J(add, 1, %3)
_, B3 = J(div, %1, %4)
# IR for backward pass
%5, %6 = B3(1)
%7 = B2(%6)
%8 = B1(%7)
return %5, %8
```

The structure of the above code may resemble the sequence of computations in Eq.2, but its very different in terms of implementation (Refer to Fig.2 below). The code (IR) is constructed at compile time by a compiler-like framework (the DiffProg system). The derivative IR is then passed onto an IR-optimizer which can squeeze its guts by enabling various optimization like dead-code elimination, operation-fusion and more advanced ones. And finally compiling it down to machine code.

`Zygote`

: A DiffProg framework for JuliaJulia is a particularly interesting language when it comes to implementing a DiffProg framework. There are solid reasons why `Zygote`

, one of the most successful DiffProg frameworks is written in Julia. I will try to articulate few of them below:

**Just-In-Time (JIT) compiler:**Julia’s efficient Just-in-time (JIT) compiler compiles one statement at a time and run it immediately before moving on to the next one, achieving a striking balance between interpreted and compiled languages.**Type specialization:**Julia allows writing generic/optional/loosely-typed functions that can later be instantiated using concrete types. High-density computations specifically benefit from it by casting every computation in terms of`Float32/Float64`

which can in turn produce specialized instructions (e.g.`AVX`

,`MMX`

,`AVX2`

) for modern CPUs.**Pre-compilation:**The peculiar feature that benefits`Zygote`

the most is Julia’s nature of keeping track of the compilations that have already been done in the current session and DOES NOT do them again. Since DL/ML is all about computing gradients over and over again,`Zygote`

compiles and optimizes the derivative program (IR) just once and runs it repeatedly (which is blazingly fast) with different value of parameters.**LLVM IR:**Julia uses LLVM compiler infrastructure as its backend and hence emits the LLVM IR known to be highly performant and used by many other prominent languages.

Now, let’s see `Zygote`

’s primary API, which is surprisingly simple. The central API of `Zygote`

is the function `Zygote.gradient(..)`

with its first argument being the function to be differentiated (written in native Julia) followed by its argument at which gradient to be computed.

```
julia> using Zygote
julia> function F(x)
return 3x^2 + 2x + 1
end
julia> gradient(F, 5)
(32,)
```

That is basically computing for .

For debugging purpose, we can see the *actual* LLVM IR code for a given function and its pullback. The actual IR is a bit more complex in reality than what I showed but similar in high-level structure. We can peek into the IR of the above function

```
julia> Zygote.@code_ir F(5)
1: (%1, %2)
%3 = Core.apply_type(Base.Val, 2)
%4 = (%3)()
%5 = Base.literal_pow(Main.:^, %2, %4)
%6 = 3 * %5
%7 = 2 * %2
%8 = %6 + %7 + 1
return %8
```

.. and also its “Adjoint”. The adjoint in Zygote is basically the mathematical functional that we’ve been seeing all along.

```
julia> Zygote.@code_adjoint F(5)
Zygote.Adjoint(1: (%3, %4 :: Zygote.Context, %1, %2)
%5 = Core.apply_type(Base.Val, 2)
%6 = Zygote._pullback(%4, %5)
...
# please run yourself to see the full code
...
%13 = Zygote.gradindex(%12, 1)
%14 = Zygote.accum(%6, %10)
%15 = Zygote.tuple(nothing, %14)
return %15)
```

I have established throughout this article that the function can literally be any arbitrary program written in native Julia using standard language features. Let’s see another toy (but meaningful) program using more flexible Julia code.

```
struct Point
x::Float64
y::Float64
Point(x::Real, y::Real) = new(convert(Float64, x), convert(Float64, y))
end
# Define operator overloads for '+', '*', etc.
function distance(p₁::Point, p₂::Point)::Float64
δp = p₁ - p₂
norm([δp.x, δp.y])
end
p₁, p₂ = Point(2, 3.0), Point(-2., 0)
p = Point(-1//3, 1.0)
# initial parameters
K₁, K₂ = 0.1, 0.1
for i = 1:1000 # no. of epochs
# compute gradients
@time δK₁, δK₂ = Zygote.gradient(K₁, K₂) do k₁::Float64, k₂::Float64
p̂ = p₁ * k₁ + p₂ * k₂
distance(p̂, p) # scalar output of the function
end
# update parameters
K₁ -= 1e-3 * δK₁
K₂ -= 1e-3 * δK₂
end
@show K₁, K₂
# shows "(K₁, K₂) = (0.33427804653861276, 0.4996408206795386)"
```

The above program is basically solving the following (pretty simple) problem

where and . By choosing these specific numbers, I guaranteed that there is a solution for .

If you look at the program at a glance, you would notice that the whole program is almost entirely written in native Julia using structure (`struct Point`

), built-in function (`norm()`

, `convert()`

), memory access constructs (`δp.x`

, `δp.y`

) etc. The only usage of Zygote is that `Zygote.gradient()`

call in the heart of the loop. BTW, I omitted the operator overloading functions for space restrictions.

I am not showing the IR codes for this one; you are free to execute `@code_ir`

and `@code_adjoint`

on the function implicitly defined using the `do .. end`

. One thing I would like to show is the execution speed and my earlier argument about “precompilation”. The time measuring macro (`@time`

) shows this

```
11.764279 seconds (26.50 M allocations: 1.342 GiB, 4.58% gc time)
0.000025 seconds (44 allocations: 2.062 KiB)
0.000026 seconds (44 allocations: 2.062 KiB)
0.000007 seconds (44 allocations: 2.062 KiB)
0.000006 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
0.000005 seconds (44 allocations: 2.062 KiB)
```

Did you see how the execution time reduced by an astonishingly high margin ? That’s Julia’s precompilation at work - it compiled the derivative program only once (on its first encounter) and produces highly efficient code to be reused later. It might not be as big a surprise if you already know Julia, but it is definitely a huge advantage for a DiffProg framework.

Okay, that’s about it today. See you next time. The following references have been used for preparing this article:

- “Don’t Unroll Adjoint: Differentiating SSA-Form Programs”, Michael Innes, arXiv/1810.07951.
- Talk by Michael Innes @ Julia london user group meetup.
- Talk by Elliot Saba & Viral Shah @ Microsoft research.
- Zygote’s documentation & Julia’s documentation.

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Differentiable {Programming:} {Computing} Source-Code
Derivatives},
date = {2020-09-08},
url = {https://ayandas.me//blogs/2020-09-08-differentiable-programming.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Differentiable Programming: Computing
Source-Code Derivatives.” September 8, 2020. https://ayandas.me//blogs/2020-09-08-differentiable-programming.html.

The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for Directed PGMs. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.

We model a set of random variables (in our example, ) whose connections are defined by graph and have *“potential functions”* defined on each of its maximal cliques . The total potential of the graph is defined as

is an arbitrary instantiation of the set of RVs denoted by . The potential functions are basically “affinity” functions on the state space of the cliques, e.g. given a state of a clique , the corresponding potential function returns the *viability of that state* OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are *arbitrary non-negative values*. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as . If we assume the variables are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:

Just like every other model in machine learning, the potential functions can be parameterized, leading to

Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.

When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of **energy** functions where

The enforces the potentials to be always non-negative and thus we are free to choose an *unconstrained* energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is *reverts the semantic meaning* of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are *bad*, i.e. less energy means a stable system.

Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields

Here we defined to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph *from multiplicative (Eq.1) to additive (Eq.3)*. This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.

All this is fine .. well .. unless we need to do things like *sampling*, *computing log-likelihood* etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.

The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain

This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to `Boltzmann Distribution`

. Here’s what the Wikipedia says:

In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …

From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:

The denominator of Eq.4 is often known as the “Partition Function” (denoted as ). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of .

A hyper-parameter called “temperature” (denoted as ) is often introduced in Eq.4 which also has its roots in the original Boltzmann Distribution from Physics. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider for the rest of the article.

The question now is - how do I learn the model given a dataset ? Let’s say my dataset has samples: . An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset

Computing gradient w.r.t. parameters yields

Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the *data distribution* - essentially picking up data from our dataset. The second expectation is on the *model distribution* - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule

The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points *coming from data*. And the second one tries to maximize (notice the difference in sign) the energy function at points *coming from the model*. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. . At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm *pushes the energy down* at places where original data lies; it also *pull the energy up* at places which the *model thinks* original data lies.

Whatever may be the interpretation, as I mentioned before that the denominator of (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.

As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the *conditional densities* of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the cancels out. Conditional density of one variable (say ) given the others (let’s denote by ) is:

I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called Gibbs Sampling. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.

To sample , we iteratively execute the following for iterations

- We have a sample from last iteration as
- We then pick one variable (in some order) at a time and sample from its conditional given the others: . Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.

We can start this process by setting to anything. If is sufficiently large, the samples towards the end are true samples from the density . To know it a bit more rigorously, I **highly recommend** to go through this. You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an MCMC family algorithm which has something called “Burn-in period”.

Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.

Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector with components . All RVs are connected to all others by an undirected graph .

By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:

Upon expanding the vectorized form (reader is encouraged to try), we can see each term for all as the contribution of pair of RVs to the whole energy function. is the “connection strength” between them. If a pair of RVs turn on together more often, a high value of would encourage reducing the total energy. So by means of learning, we expect to see going up if fire together. This phenomenon is the founding idea of a closely related learning strategy called Hebbian Learning, proposed by Donald Hebb. Hebbian theory basically says:

If fire together, then wire together

How do we learn this model then ? We have already seen the general way of computing gradient. We have . So let’s use Eq.5 to derive a learning rule:

Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:

where is the Sigmoid function and denote the vector of parameters connecting with the rest of the variables . I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:

To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help *explaining* the visible variables (see my Directed PGM article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).

**[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]**

Suppose we have hidden units and visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden (), visible-visible () and hidden-hidden () interactions. We compactly represent them as .

The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.

It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:

Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.

Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters

If you are paying attention, you might notice something strange .. how do we compute the terms (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors in dataset and we can get an approximate *complete data* (visible plus hidden) density as

Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).

For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. **There is a clever hack though**. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “Contrastive Divergence” and has long been used in practical implementations.

Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.

RBM is basically same as Boltzmann Machine with hidden units but with *one big difference* - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.

If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of and from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.

Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.

That means they can be computed in parallel

Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:

- Sample a hidden vector
- Sample a visible vector

This makes RBM an attractive choice for practical implementation.

Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Energy {Based} {Models} {(EBMs):} {A} Comprehensive
Introduction},
date = {2020-08-13},
url = {https://ayandas.me//blogs/2020-08-13-energy-based-models-one.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Energy Based Models (EBMs): A Comprehensive
Introduction.” August 13, 2020. https://ayandas.me//blogs/2020-08-13-energy-based-models-one.html.

Before I dive into details, let’s get the bigger picture clear. It is highly advisable to read any good reference about PGMs before you proceed - my previous article for example.

Probabilistic Programming is NOT really what we usually think of as *programming* - i.e., completely deterministic execution of hard-coded instructions which does exactly what its told and nothing more. Rather it is about building PGMs (must read this) which models our belief about the data generation process. We, as users of such language, would express a model in an imperative form which would encode all our uncertainties in the way we want. Here is a (Toy) example:

```
def model(theta):
A = Bernoulli([-1, 1]; theta)
P = 2 * A
if A == -1:
B = Uniform(P, 0)
else:
B = Uniform(0, P)
C = Normal(B, 1)
return A, P, B, C
```

If you assume this to be valid program (for now), this is what we are talking about here - all our traditional “variables” become “random variables” (RVs) and have uncertainty associated with them in the form of probability distributions. Just to give you a taste of its flexibility, here’s the constituent elements we encountered

- Various different distributions are available (e.g., Normal, Bernoulli, Uniform etc.)
- We can do deterministic computation (i.e., )
- Condition RVs on another RVs (i.e., )
- Imperative style branching allows dynamic structure of the model …

Below is a graphical representation of the model defined by the above program.

Just like the invocation of a traditional compiler on a traditional program produces the desired output, this (probabilistic) program can be executed by means of “ancestral sampling”. I ran the program 5 times and each time I got samples from all my RVs. Each such “forward” run is often called an *execution trace* of the model.

```
>>> for _ in range(5):
print(model(0.5))
(1.000, 2.000, 0.318, -0.069)
(-1.000, -2.000, -1.156, -2.822)
(1.000, 2.000, 0.594, 0.865)
(1.000, 2.000, 1.100, 1.079)
(-1.000, -2.000, -0.262, -0.403)
```

This is the so called “generative view” of a model. We typically use the leaf-nodes of PGMs as our observed data. And rest of the graph can be the “latent factors” of the model which we either know or want to estimate. In general, a practical PGM can often be encapsulated as a set of latent nodes and visible nodes related probabilistically as

From now on, we’ll use the general notation rather than the specific example. The model may be parametric. For example, we had the bernoulli success probability in our toy example. The full joint probability is given as

We would like to do two things: 1. Estimate model parameters from data 2. Compute the posterior, i.e., infer latent variables given data

As discussed in my PGM article, both of them are infeasible due to the fact that 1. Log-likehood maximization is not possible because of the presence of latent variables 2. For continuous distributions on latent variables, the posterior is intractible

The way forward is to take help of *Variational Inference* and maximize our very familiar **E**vidence **L**ower **BO**und (ELBO) loss to estimate the model parameters and also a set of variational parameters which help building a proxy for the original posterior . Mathematically, we choose a known and tractable family of distribution (parameterized by variational parameters ) to approximate the posterior. The learning process is facilitated by maximizing the following

by estimating gradients w.r.t all its parameters

If you have gone through my PGM article, you might think you’ve seen these before. Actually, you’re right ! There is really nothing new to this. What we really need for establishing a Probabilistic Programming framework is **a unified way to implement the ELBO optimization for ANY given problem**. And by “problem” I mean the following:

- A model specification written in a probabilistic language (like we saw before)
- An optional (parameterized) “Variational Model” , famously known as a “Guide”
- And .. the observed data , of course

But, how do we compute (1) ? The appearent problem is that gradient w.r.t. is required but it appears in the expectation itself. To mitigate this, we make use of the famous trick known as the “log-derivative” trick (it actually has many other names like REINFORCE etc). For notational simplicy let’s denote and continue from (1)

Eq. (2) shows that the trick helped the to penetrate the , but in the process, it changed the original with a “surrogate function” where the *bar* protects a quantity from differentiation. Equation (2) is all we need - it provides an insight on how to make the gradient estimation practical. In fact, it can be proven theoretically that this gradient is an unbiased estimate of the true gradient in Equation (1).

Succinctly, we run the Guide times to record a set of execution-traces (i.e., samples ) and compute the following Monte-Carlo approximation to Equation (2)

The nice thing about Equation (2) (or equivalently Equation (3)) is we got the differentiation operator right on top of a deterministic function (i.e., ). It means we can construct as a computation graph and take advantage of modern day automatic differentaition engines. Figure 1 shows how the computation graph and the graphical model are linked

Last but not the least, let’s look at the function which is basically built on the log-density terms and . We need a way to compute them flexibly. Please remember that the model and guide is written in a *language* and hence we have access to their graph-structure. A clever software implementation can harness this structure to estimate the log-densities (and eventually ).

I claimed before that the gradient estimates are unbiased. However, such generic way of computing the gradient introduces high variance in the estimate and make things unstable for complex models. There are few tricks used widely to get around them. But please note that such tricks always exploits model-specific structure. Three such tricks are presented below.

We might get lucky that is re-parameterizable. What that means is the expectation w.r.t can be made free of its parameters and by doing so the gradient operator can be pushed inside without going through the log-derivative trick. So, let’s step back a bit and consider the original ELBO gradient in (1). Assuming re-parameterizable nature, the following can be done

Where is an independent source of randomness. Computing this expectation with empirical average (just like Eq.2) gives us a better (variance reduced) estimate of the true gradient of ELBO.

This is another well-known variance reduction technique. It is a bit mathematically rigorous, so I will explain it simply without making it confusing. This requires the full variational distributions to have some kind of factorization. A specific case is when we have mean-field assumption, i.e.

With a little effort, we can pull out the gradient estimator for each of these parameters from (2). They look something like this

The reason why the quantity under bar still has all the factors because it is immune to gradient operator. Also because the expectation is outside the gradient operator, it contains all factors. At this point, the Rao-Blackwellization offers a variance-reduced estimate of the above gradient, i.e.,

where is the set of variables that forms the “markov blanket” of w.r.t to the structure of guide, is the part of the variational distribution that depends on and is the factors of the model that involves .

While exploiting the graph structure of the guide while simplifying (1), we might end up getting a term like this due to factorization in the guide density

If it happens that the variable is discrete with the size of its state space reasonably small (e.g., a dimensional binary RV has states), we can replace sampling-based empirical expectations with true expectation where we have to evaluate a sum over its entire state-space

So make sure the state-space is resonable in size. This helps reducing the variance quite a bit.

Whew ! That’s a lot of maths. But good thing is, you hardly ever have to think about them in detail because software engineers have put tremendous effort to make these algorithms as easily accessible as possible via libraries. One of them we are going to have a brief look on.

`Pyro`

: Universal Probabilistic ProgrammingPyro is a probabilistic programming framework that allows users to write flexible models in terms of a simple API. Pyro is written in Python and uses the popular PyTorch library for its internal representation of computation graph and as auto differentiation engine. Pyro is quite expressive due to the fact that it allows the model/guide to have fully imperative flow. It’s core API consists of these functionalities

`pyro.param()`

for defining learnable parameters.`pyro.dist`

contains a large collection of probability distribution.`pyro.sample()`

for sampling from a given distribution.

Let’s take a concrete example and work it out.

MoG (Mixture of Gaussian) is a realatively simple but widely studied probabilistic model. It has an important application in soft-clustering. For the sake of simplicity we assume we only have two mixtures. The generative view of the model is basically this: we flip a coin (latent) with bias and depending on the outcome we sample data from either of the two gaussian and

where is data index, is the set of model parameters we need to learn. This is how you write that in Pyro:

```
def model(data): # Take the observation
# Define coin bias as parameter. That's what 'pyro.param' does
rho = pyro.param("rho", # Give it a name for Pyro to track properly
torch.tensor([0.5]), # Initial value
constraint=dist.constraints.unit_interval) # Has to be in [0, 1]
# Define both means and std with random initial values
means = pyro.param("M", torch.tensor([1.5, 3.]))
stds = pyro.param("S", torch.tensor([0.5, 0.5]),
constraint=dist.constraints.positive) # std deviation cannot be negative
with pyro.plate("data", len(data)): # Mark conditional independence
# construct a Bernoulli and sample from it.
c = pyro.sample("c", dist.Bernoulli(rho)) # c \in {0, 1}
c = c.type(torch.LongTensor)
X = dist.Normal(means[c], stds[c]) # pick a mean as per 'c'
pyro.sample("x", X, obs=data) # sample data (also mark it as observed)
```

Due to the discrete and low dimensional nature of the latent variable , this problem is in general tracktable in terms of computing posterior. But let’s assume it is not. The true posterior is the quantity known as “assignment” that reveals the latent factor, i.e., what was the coin toss result when a given was sampled. We define a guide on , parameterized by variational parameters

In Pyro, we define a guide that encodes this

```
def guide(data): # Guide doesn't require data; just need the value of N
with pyro.plate("data", len(data)): # conditional independence
# Define variational parameters \lambda_i (one for every data point)
lam = pyro.param("lam",
torch.rand(len(data)), # randomly initiallized
constraint=dist.constraints.unit_interval) # \in [0, 1]
c = pyro.sample("c", # Careful, this name HAS TO BE same to match the model
dist.Bernoulli(lam))
```

We generate some synthetic data from the following simualator to train our model on.

```
def getdata(N, mean1=2.0, mean2=-1.0, std1=0.5, std2=0.5):
D1 = np.random.randn(N//2,) * std1 + mean1
D2 = np.random.randn(N//2,) * std2 + mean2
D = np.concatenate([D1, D2], 0)
np.random.shuffle(D)
return torch.from_numpy(D.astype(np.float32))
```

Finally, Pyro requires a bit of boilerplate to setup the optimization

```
data = getdata(200) # 200 data points
pyro.clear_param_store()
optim = pyro.optim.Adam({})
svi = pyro.infer.SVI(model, guide, optim, infer.Trace_ELBO())
for t in range(10000):
svi.step(data)
```

That’s pretty much all we need. I have plotted the (1) ELBO loss, (2) Variational parameter for every data points, (3) The two gaussians in the model and (4) The coin bias as the training progresses.

The full code is available in this gist: https://gist.github.com/dasayan05/aca3352cd00058511e8372912ff685d8.

That’s all for today. Hopefully I was able to convey the bigger picture of probabilistic programming which is quite useful for modelling lots of problems. The following references the sources of information while writing the article. Interested readers are encouraged to check them out.

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Introduction to {Probabilistic} {Programming}},
date = {2020-05-05},
url = {https://ayandas.me//blogs/2020-04-30-probabilistic-programming.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Introduction to Probabilistic
Programming.” May 5, 2020. https://ayandas.me//blogs/2020-04-30-probabilistic-programming.html.

Let’s start with something simple. Consider a Random Variable ( being time) with support with equal probability on both of its possible values. Think of it as a *score* you get at time which can be either or as a result of an unbiased coin-flip. In terms of probability:

Realization (samples) of for would look like

Let us define another Random Variable which is nothing but an accumulator of till time . So, by definition

Realization of corresponding to above sequence would look like

This is popularly known as the **Random Walk**. With the basics ready, let us have two such random walks namely and and treat them as and coordinates of a *Random Vector* namely .

As of now it look all nice and mathy, right ! Here’s the fun part. Let me keep the time (i.e., ) running and keep track on the path that the vector traces on a 2D plane

It will create a cool random checkerboard-like pattern as time goes on. Looking at the tip (the ‘dot’), you might see it as a tiny particle. As it happened that this is a discretized verision of a continuous phenomenon observed in real microscopic particles in fluid, famously known as **Brownian Motion**.

Real Brownian Motion is continuous. Let’s work it out, but very briefly. We divide an arbitrary time interval into small intervals of length and have a modified score Random Variable with support with equal probability as before. We still have the same definition of . It so happened that as we appraoch the limiting case of

it gives us the continuous analogue of **Brownian Motion**. Similar to the discrete case, if we trace the path of with large (yes, in practice we cannot go to infinity, sorry), patterns like this will emerge

To make it more artistic, I took an even bigger and ran the simulation for quite a while and got quite beautiful jittery patterns. Random numbers being at the heart of the phenomenon, we’ll get different patterns in different runs. Here are two such simulation results:

**Want to learn more ?**

Dynamical Systems are defined by a state space and a system dynamics (a function ). A state is a specific (abstract) configuration of a system and the dynamics determines how the state “evolves” over time. The dynamics is often represented by a differential equation that specifies the chnage of state over time. So,

The true states of the system at some point of time is determined by solving and Initial Value Problem (IVP) starting from an initial state . We then solve consecutive states with as

Having sufficiently small ensures propert evolution of states.

Now this may seem quite trivial, at least to those who have studied Differential Equations. But, there are specific cases of which leads to an evolution of states whose trajectory is surprisingly beautiful. For reasons that are beyond the scope of this article, these are called **Chaos**. There is a specific branch of dynamical systems (named “Chaos Theory”) that deals with characteristics of such chaotic systems. Below are three such chaotic systems with there trajectory visualized in 3D state space. To be specific, we take each system with an initial state (they are very sensitive to initial states) and compute successive states with a small enough and visualize them as a continuous path in 3D. The corresponding figures depict an animation of the evolution of states over time as well as the whole trajectory all at once.

**Want to learn more ?**

We all know about Fourier Series, right ! But I am sure not all of you have seen this artistic side of it. Well, this isn’t really related to fourier series, but fourier series helps in creating them.

We know the following to be the “synthesis equation” of complex fourier series

which represents the synthesis of a periodic function of period from its frequency components . Often, as a practical measure, we crop the infinite summation to a limited range . Furthermore, let’s consider without lose of generality. So, we see as a function parameterized by the frequence components

By doing this, we can make complex valued functions by putting different and running . However, not all leads to anything visually appealing. A particular feature of an object that appeals to the human eyes is “Symmetry”. We are gonna exploit this here. A little refresher on fourier series will make you realize that if the coefficients are real-valued, then has symmetric property. And that’s all we need.

We pick random (see, its real numbers now) and run the clock and trace the path travelled by the complex point as time progresses. It creates patterns like the ones shown below

There is one way to customize these - the value of . As we know that has the interpretation of the magnitude of frequency component. A large value of implies the introduction of more high frequency into the time-domain signal. This visually leads to having finer details (i.e., more curves and bendings). Lowering the value of would clear out these fine details and the path will become more and more flat. The below image shows decreasing value of along columns. You can see the patterns losing details as we go right. And just like before, every run will create different patterns as they are solely controlled by random numbered coefficients.

**Want to learn more ?**

These two sets are very important in the study of “Fractals” - objects with self-repeating patterns. Fractals are extremely popular concepts in certain branches of mathematics but they are mostly famous for having eye-catching visual appearance. If you ever come across an article about fractals, you are likely to see some of the most artistic patterns you’ve ever seen in the context of mathematics. Diving into the details of fractals and self-repeating patterns will open a vast world of “Mathematical Art”. Although, in this article, I can only show you a tiny bit of it - two sets namely “Mandelbrot” and “Julia” set. Let’s start with the *all important function*

where are complex numbers. This appearantly simply complex-valued function is in the heart of these sets. All it does is squares its argument and adds a complex number that the function is parameterized with. Also, we denote as times repeated application of the function on a given , i.e.

With these basic definitions in hand, the **Mandelbrot set** (invented by mathematician Benoit Mandelbrot) is the set of all for which

Simply put, there is a set of values for where if you repeatedly apply on zero (i.e. ), the output *does not diverge*. All such values of makes the so called “Mandelbrot Set”. For the values of that does not diverge, can be characterized by how many repeated application of they can tolerate before their absolute value goes higher than a predefined “*escape radius*”, let’s call it . This creates a loose sense of “strength” of a certain that can be written as

It might look all strange but if you treat the integer as grayscale intensity value for a grid of points on 2D complex plane (i.e., an image), you will get a picture similar to this (Don’t get confused, the picture is indeed grayscale; I added PyPlot’s `plt.cm.twilight_shifted`

colormap for enhancing the visual appeal). The grid is in the range and the escape radius is .

What is so fascinating about this pattern is the fact that it is self-repeating. If you zoom into a small portion of the image, you would see the same pattern again.

Another very similar concept exists, called the “Julia Set” which exhibits similar visual diagram. Unlike Mandelbrot set, we consider a to be in Julia set if

Please note that this time the set is parameterized by and we are interested in how the *argument of the function* behaves under repeated application of . Now things from here are similar. We define a similar “strength” for every as

Please note that as a result of this new definition, the diagram is parameterized by , i.e., we will get different image for different . In principle, we can visualize such images for different (they are indeed pretty cool), but let’s go a bit further than that. We will vary along a trajectory and produce the diagrams for each and see them as an animation. This creates an amazing visual effect. Technically, I varied along a circle of radius , i.e., with

**Want to know more ?**

Alright then ! That is pretty much it. Due to constraint of time, space and scope its not possible to explain everything in detail in one article. There are plenty of resources available online (I have already provided some link) which might be useful in case you are interested. Feel free to explore the details of whatever new you learnt today. If you would like to reproduce the diagrams and images, please use the code here https://github.com/dasayan05/patterns-of-randomness (sorry, the code is a bit messy, you have to figure out).

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Patterns of {Randomness}},
date = {2020-04-15},
url = {https://ayandas.me//blogs/2020-04-15-patterns-of-randomness.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Patterns of Randomness.” April 15, 2020.
https://ayandas.me//blogs/2020-04-15-patterns-of-randomness.html.

Let’s put Neural ODEs aside for a moment and take a refresher on ODE itself. Because of their unpopularity in the deep learning community, chances are that you haven’t looked at them since high school. We will focus our discussion on first-order linear ODEs which takes a generic form of

where . Please recall that ODEs are differential equations that involve only one independent variable, which in our case is . Geometrically, such an ODE represents a *family of curves/functions* , also called the *solutions* of the ODE. The function , often called the *dynamics of the system*, denotes a common characteristics of all the solutions. Specifically, it denotes the first-derivative (slope) of all the solutions. An example would make things clear: let’s say the dynamics of an ODE is . With the help of basic calculus, we can see the family of solutions are for any value of .

Just like any other algorithms in Deep Learning, we can (and we have to) go beyond space and eshtablish similar ODEs in higher dimension. A *system of ODEs* with dependent variables and independent variable can be written as

With a vectorized notation of and , we can write

The dynamics can be seen as a **Vector Field** where given any , denotes its gradient with respect to . The independent variable can often be regarded as **time**. For example, Fig.1 shows the space and a dynamics defined on it. Please note that it is a time-invariant system, i.e., the dynamics is independent of . A system with time-dependent dynamics would have a different gradient on a given depending on which time you visit it.

Although I showed the solution of an extremely simple system with dynamics , most practical systems are far from it. Systems with higher dimension and complicated dynamics are very difficult to solve analytically. This is when we resort to *numerical methods*. A specific way of solving any ODE numerically is to solve an **Initial Value Problem** where given the system (dynamics) and an *initial condition*, one can iteratively “trace” the solution. I emphasized the term *trace* because that’s what it is. Think of it as dropping a small particle on the vector field at some point and let it *flow according to the gradients* at any point.

Fig.2 shows two different initial condition (red dots) leads to two different curves/solution (a small segment of the curve is shown). These curves/solutions are from the family of curves represented by the system whose dynamics is shown with black arrows. Different numerical methods are available on how well we do the “tracing” and how much error we tolerate. Strating from naive ones, we have modern numerical solvers to tackle the initial value problems. We will focus on one of the simplest yet popular method known as **Forward Eular’s method** for the sake of simplicity. The algorithm simply does the following: It starts from a given initial state at and literally goes in the direction of gradient at that point, i.e. and keeps doing it till using a small step size of . The following iterative update rule summerizes everything

In case you haven’t noticed, the formula can be obtained trivially from the discretized version of analytic derivative

If you look at Fig.2 closely enough, you would see the red curves are made up of discrete segements which is a result of solving an initial value problem using Forward Eular’s method.

Let’s look at the core structure of ResNet, an extremely popular deep network that almost revolutionized deep network architecture. The most unique structural component of ResNet is its residual blocks that computes “increaments” on top of previous layer’s activation instead of activations directly. If the activation of layer is then

where is the residual function (increament on top of last layer). I am pretty sure the reader can feel where it’s going. Yes, the residual architectire resembles the forward eular’s method on an ODE with dynamics . Having such residual layers is similar to executing steps of forward eular’s method with step size . The idea of Neural ODE is to “*parameterize the dynamics of this ODE explicitely rather than parameterizing every layer*”. So we can have

and successive layers can be realized by -step forward eular evaluations. As you can guess, we can choose as per our requirement and in limiting case we can think of it as an infinite layer ($N $) network. Although you must understand that such parameterization cannot provide an infinite capacity as the number of parameters is shared and finite. Fig.3 below depicts the resemblance of ResNet with forward eular iteration.

Although we already went over this in the last section, but let me put it more formally one more time. An “ODE Layer” is basically characterized by its dynamics function which can be realized by a (deep) neural network. This network takes input the “current state” (basically activation) and time and produces the “direction” (i.e., ) where the state should go next. A full forward pass through this layer is essentially executing an step Forward Eular on the ODE with an “initial state” (aka “input”) . is a hyperparameter we choose and can be compared to “depth” in standard deep neural network. Following the original paper’s convention (with a bit of python-style syntax), we write the forward pass as

where the “ODESolve” is *any* iterative ODE solver algorithm and not just Forward Eular. By the end of this article you’ll understand why the specific machinery of Eular’s method is not essential.

Coming to the backward pass, a naive solution you might be tempted to offer is to back-propagate thorugh the operations of the solver. I mean, look at the iterative update equation Eq.1 of an ODE Solver (for now just Forward Eular) - everything is indeed differentiable ! But then, it is no better than ResNet, not from a memory cost point of view. Note that backpropagating through a ResNet (and so with any standard deep network) requires storing the intermediate activations to be used later for the backward pass. Such operation is resposible for the memory complexity of backpropagation being linear in number of layers (i.e., ). This is where the authors proposed a brilliant idea to make it by not storing the intermediate states.

Just like any other computational graph associated with a deep network, we get a gradient signal coming from the loss. Let’s denote the incoming gradient at the end of the ODE layer as , where is a scalar loss. All we have to do is use this incoming gradient to compute and perform an SGD (or any variant) step. A bunch of parameter updates in the right direction would cause the dynamics to change and consequently the whole trajectory (i.e., trace) except the input. Fig.4 shows a graphical representation of the same. Please note that for simplicity, the loss has been calculated using itself. To be specific, the loss (green dotted line) is the euclidian distance between and its (avaialble) ground truth .

In order to accomplish our goal of computing the parameter gradients, we define a quantity , called the “Adjoint state”

comparing to a standard neural network, this is basically the gradient of the loss w.r.t all intermediate activations (states of the ODE). It is indeed a generalization of a quantity I mentioned earlier, i.e., the incoming gradient into the layer . Although we cannot compute this quantity independently for every timestep, a bit of rigorous mathematics (refer to appendix B.1 of original paper) can show that the adjoint state follows a differential equation with a dynamics function

and that’s a good news ! We now have the dynamics that follows and an initial value (value at the extreme end ). That means we can run an ODE solver backward in time from and calculate all in succession, like this

Please look at Eq.2 for the signature of the “ODESolve” function. This time we also produced all intermediate states of the solver as output. An intuitive visualization of the adjoint state and its dynamics is given in Fig.5 below.

The quantity on the right hand side of Eq.3 is a vector-jacobian product where is the jacobian matrix. Given the functional form of , this can be readily computed using the current state and the latest parameter value. But wait a minute ! I said before that we are not storing the intermediate values. Where do we get them now ? The answer is - we can compute them again. Please remeber that we still have with us along with an extreme value (output of the forward pass). We can run another ODE backwards in time starting from . Essentially we can fuse two ODEs together

Its basically executing two update equations for two ODEs in one “for loop” traversing from . The intermediate values of wouldn’t be exactly same as what we got in the forward pass (because no numerical solver is of infinite precision), but they are indeed good approximations.

Okay, what about the parameters of the model (dynamics) ? How do we get to our ultimate goal, ?

Let’s define another quantity very similar to the adjoint state, i.e., the parameter gradient of the loss at every step of the ODE solver

Point to note here is that as the parameters do not change during a trajectory. Instead, these quantities signify *local influences* of the parameters at each step of computation. This is very similar to a roll-out of RNN in time where parameters are shared accross time steps. With a proof very similar to that of the adjoint state, it can be shown that

just like shared-weight RNNs, we can compute the full parameter gradient as combination of local influences

The quantity is another vector-jacobian product and can be evaluted using the values of , and latest parameter . So do we need another pass over the whole trajectory as Eq.4 consists of a integral ? **Fortunately, NO**. Let me bring your attention to the fact that whatever we need to compute this vector-jacobian is already being computed in the fused ODE we saw before. Furthermore we can tweak the Eq.4 as

I hope you are seeing what I am seeing. This is equivalent to solving yet another ODE (backwards in time, again!) with dynamics and initial state . The end state of this ODE completes the whole integral in Eq.5 and therefore is equal to . Just like last time, we can fuse this ODE with the last two combined

Take some time to digest the final 3-way ODE and make sure you get it. Because that is pretty much it. Once we get the parameter gradient, we can continue with normal stochastic gradient update rule (SGD or family). Additionally you may want to pass to the computation graph that comes before our ODE layer. A representative diagram containing a clear picture of all the ODEs and their interdependencies are shown above.

Implementing this algorithm is a bit tricky due to its non-conventional approach for gradient computations. Specially if you are using library like PyTorch which adheres to a specific model of computation. I am providing a very simplified implementation of ODE Layer as a PyTorch `nn.Module`

. Because this post has already become quite long and stuffed with maths and new concepts, I am leaving it here. I am putting the core part of the code (well commented) here just for reference but a complete application can be found on this GitHub repo of mine. My implementation is quite simplified as I have hard-coded “Forward Eular” method as the only choice of ODE solver. Feel free to contribute to my repo.

```
#############################################################
# Full code at https://github.com/dasayan05/neuralode-pytorch
#############################################################
import torch
class ODELayerFunc(torch.autograd.Function):
@staticmethod
def forward(context, z0, t_range_forward, dynamics, *theta):
delta_t = t_range_forward[1] - t_range_forward[0] # get the step size
zt = z0.clone()
for tf in t_range_forward: # Forward eular's method
f = dynamics(zt, tf)
zt = zt + delta_t * f # update
context.save_for_backward(zt, t_range_forward, delta_t, *theta)
context.dynamics = dynamics # 'save_for_backwards() won't take it, so..
return zt # final evaluation of 'zt', i.e., zT
@staticmethod
def backward(context, adj_end):
# Unpack the stuff saved in forward pass
zT, t_range_forward, delta_t, *theta = context.saved_tensors
dynamics = context.dynamics
t_range_backward = torch.flip(t_range_forward, [0,]) # Time runs backward
zt = zT.clone().requires_grad_()
adjoint = adj_end.clone()
dLdp = [torch.zeros_like(p) for p in theta] # Parameter grads (an accumulator)
for tb in t_range_backward:
with torch.set_grad_enabled(True):
# above 'set_grad_enabled()' is required for the graph to be created ...
f = dynamics(zt, tb)
# ... and be able to compute all vector-jacobian products
adjoint_dynamics, *dldp_ = torch.autograd.grad([-f], [zt, *theta], grad_outputs=[adjoint])
for i, p in enumerate(dldp_):
dLdp[i] = dLdp[i] - delta_t * p # update param grads
adjoint = adjoint - delta_t * adjoint_dynamics # update the adjoint
zt.data = zt.data - delta_t * f.data # Forward eular's (backward in time)
return (adjoint, None, None, *dLdp)
class ODELayer(torch.nn.Module):
def __init__(self, dynamics, t_start = 0., t_end = 1., granularity = 25):
super().__init__()
self.dynamics = dynamics
self.t_start, self.t_end, self.granularity = t_start, t_end, granularity
self.t_range = torch.linspace(self.t_start, self.t_end, self.granularity)
def forward(self, input):
return ODELayerFunc.apply(input, self.t_range, self.dynamics, *self.dynamics.parameters())
```

That’s all for today. See you.

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Neural {Ordinary} {Differential} {Equation} {(Neural} {ODE)}},
date = {2020-03-20},
url = {https://ayandas.me//blogs/2020-03-20-neural-ode.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Neural Ordinary Differential Equation (Neural
ODE).” March 20, 2020. https://ayandas.me//blogs/2020-03-20-neural-ode.html.

A quick recap would make going forward easier.

Given a Directed PGM with countinuos latent variable and observed variable , the inference problem for turned out to be intractable because of the form of its posterior

To solve this problem, VI defines a *parameterized approximation* of , i.e., and formulates it as an optimization problem

The objective can further be simplified as

is precisely the objective we maximize. The can best be explained by decomposing it into two factors. One of them takes care of maximizing the expected conditional log-likelihood (of the data given latent) and the other arranges the latent space in a way that it matches a predifined distribution.

For a detailed explanation, go through the previous article.

Variational Autoencoder (VAE) is first proposed in the paper titled “Auto-Encoding Variational Bayes” by D.P.Kingma & Max Welling. The paper proposes two things:

- A parameterized
*inference model*instead of just - The reparameterization trick to achieve efficient training

As we go along, I will try to convey the fact that these are essentially developments on top of the general VI framework we learnt earlier. I will focus on how each of them is related to VI in the following (sub)sections.

The idea is to replace the generically parameterized in the VI framework by a data-driven model , named as *Inference model*. What does it mean ? It basically means, we are no longer interested in the unconditional distribution on but instead we want to have a conditional distribution on given observed data. Please recall our “generative view” of the model

With the inference model in hand, we now have an “inference view” as follows

It means, we can do inference just by ancestral sampling after our model is trained. Of course, we don’t know the real , so we consider a parameterized approximation as I already mentioned.

These two “views”, when combined, forms the basis of Variational Autoencoder (See *Fig.1: Subfig.1*).

The “combined model” shown above gives us insight about the training process. Please note that the model starts from (a data sample from our dataset) - generates via the Inference model - and then maps it back to again using the Generative model (See *Fig.1: Subfig.2*). I hope the reader can now guess why its called an Autoencoder ! So, we clearly have a computational advantage here: we can perform training on per-sample basis; just like Inference. This is not true for many of the approximate inference algorithms of pre-VAE era.

So, succinctly, all we have to do is a “forward pass” through the model (yes, the two sampling equations above) and maximize where is a sample we got from the Inference model. Note that we need to parameterize the generative model as well (with ). In general, we almost always choose and as a fully-differentiable functions like Neural Network (See *Fig.1: Subfig.3* for a cleaner diagram). Now we go back to our objective function from VI framework. To formalize the training objective for VAE, we just need to replace by in the VI framework (please compare the equations with the recap section)

And the objective

Then,

As usual, is a chosen distribution which we want the structure of to be; which is often *Standard Gaussian/Normal* (i.e., )

The specific parameterization of reveals that we predict a distribution in the forward pass just by predicting its parameters.

The first term of is relatively easy, its a loss function that we have used a lot in machine learning - the *log-likelihood*. Very often it is just the MSE loss between the predicted and original data . What about the second term ? It turns out that we can have closed-form solution for that. Because I don’t want unnecessary maths to clutter this post, I am just putting the formula for the readers to look at. But, I would highly recommend looking at the proof in Appendix B of the original VAE paper. Its not hard, believe me. So, putting the proper values of and into the KL term, we get

Please note that are the individual dimensions of the predicted mean and std vector. We can easily compute this in forward pass and add it to the log-likelihood (first term) to get the full (ELBO) loss.

Okay. Let’s talk about the forward pass in a bit more detail. Believe me, its not as easy as it looks. You may have noticed (*Fig.1: Subfig.3*) that the forward pass contains a sampling operation (sampling from ) which is *NOT differentiable*. What do we do now ?

I showed before that in forward pass, we get the by sampling from our parameterized inference model. Now that we know the exact form of the inference model, the sampling will look something like this

The idea is basically to make this sampling operation differentiable w.r.t and . In order to do this, we pull a trick like this

This is known as the “reparameterization”. We basically rewrite the sampling operation in a way that *separates the source of randomness* (i.e., ) from the deterministic quantities (i.e., and ). This allows the backpropagation algorithm to flow derivatives into and . However, please note that it is still not differentiable w.r.t but .. guess what .. we don’t need it ! Just having derivatives w.r.t and is enough to flow it backwards and pass it to the parameters of inference model (i.e., ). Fig.2 should make everything clear if not already.

That’s pretty much it. To wrap up, here is the full forward-backward algorithm for training VAE:

- Given from the dataset, compute .
- Compute a latent sample as
- Compute the full loss as .
- Update parameters as
- Repeat.

BibTeX citation:

```
@online{das2020,
author = {Das, Ayan},
title = {Foundation of {Variational} {Autoencoder} {(VAE)}},
date = {2020-01-01},
url = {https://ayandas.me//blogs/2020-01-01-variational-autoencoder.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2020. “Foundation of Variational Autoencoder
(VAE).” January 1, 2020. https://ayandas.me//blogs/2020-01-01-variational-autoencoder.html.

A Directed PGM, also known as Bayesian Network, is a set of random variables (RVs) associated with a graph structure (DAG) expressing conditional independance (CI assumptions) among them. Without the CI assumptions, one had to model the joint distribution over all the RVs, which would’ve been difficult. Fig. 1 shows a typical DAG expressing the conditional independance among the set of participating RVs . With the CI assumptions in place, we can write the join distribution over as

In general, joint distribution over a set of RVs with CI assumptions encoded in a graph can be written/factorized as

Where, is the parents of node according to graph . One can easily verify that the factorization of above resembles the general formula.

A key idea in Directed PGMs is the way we sample from them. We use something known as **Ancestral Sampling**. Unlike joint distributions over all random variables (), a graph structure (i.e., ) breaks it down into multiple factors which then needs to be synchronized to get a full sample from the graph. Here’s how we do it: 1. We start with RVs with no parent (according to ). We sample from them as usual. 2. Plug the samples in the conditionals involving those RVs. Sample the new RVs from those conditionals. 3. Plug the samples from step 2 further and keep sampling until all variables are sampled.

So, in the example depicted in Fig. 1

So, we get one sample as . In contrast, a full joint distribution needs to be sampled all at once

Now, as we have a clean way of representing a complicated distribution in the form of a graph structure, we can parameterize each individual distribution in the factorized form to parameterize the whole joint distribution. Parameterizing the distribution in the example in Fig. 1

For convenience, we will write the above factorization as where is the set of all parameters.

For learning, we require a set of data samples (i.e., a dataset) collected from an unknown data generating distribution . So, a dataset where each data sample is drawn as

The likelihood of the data samples under our model signifies the probability that the samples came from our model. It is simply

The goal of learning is to get a good estimate of . We do it by maximizing the likelihood (or, often log-likelihood) w.r.t , which we call **Maximum Likelihood Estimation** or **MLE** in short

Inference in a Directed PGM refers to estimating a set of RVs given another set of RVs in a graph . To do inference, we need to have an already learnt model. Inference is like “answering queries” after gathering knowledge (i.e., learning). For example, in our running example, one may ask, “What is the value of given and ?”. The question can be answered by constructing the conditional distribution using the definition of conditional probability

In case a deterministic answer is desired, one can figure out the expected value of under the above distribution

Its quite important to understand this point. In this section, I won’t tell you anything new per se, but repeat some of the things I explained in the **Ancestral Sampling** subsection above.

Given a finite set of data (i.e., a dataset), we start our model building process from asking ourselves one question, “*How my data could’ve been generated ?*”. The answer to this question is precisely “*our model*”. The model (i.e., the graph structure) we build is essentially our belief of how the data was generated. **Well, we might be wrong**. The data may have been generated by some other ways, but we always start with a belief - our model. The reason I started modelling my data with a graph structure shown in Fig.1 is because I believe all my data (i.e, ) was generated as follows:

Or, equivalently

EM algorithm solves the above problem. Although, this tutorial is not focused on EM algorithm, I will give a brief idea about how it works. Remember where we got stuck last time ? We didn’t have in our dataset and so couldn’t perform normal MLE on the model. That’s literally the only thing that stopped us. The core idea of EM algorithm is to estimate using the model and the we have in our dataset and then use that estimate to perform normal MLE.

The **Expectation (E) step** estimates from a given using the model

where

And then, the **Maximization (M) step** plugs that into the likelihood and performs standard MLE. The likelihood looks like

By repeating the *E & M steps iteratively*, we can get an optimal solution for the parameters and eventually discover the latent factors in the data.

Apart from the learning problem, which involves estimating the whole joint distribution, there exists another problem that is worth solving on its own - the **inference problem**, i.e., estimating the latent factor given an observation. For examples, we may want to estimate “pose” of an object given its image in an unsupervised way, OR, estimating identity of a person given his/her facial photograph (our last example). Although we have seen how to perform inference in the EM algorithm, I am rewriting it here for convenience.

Taking up the same example of latent variable (i.e., ), we *infer* as

This quantity is also called the **posterior**.

For continuos , we have integral instead of summation

If you are a keen observer, you might notice an appearent problem with the inference - the inference will be computationally intractable as it involves a *summation/integration over a high dimensional vector with potentially unbounded support*. For example, if the latent variable denotes a continuous “pose” vector of length , the denominator will contain a -dimensional integral over . At this point, as you might understand that even EM algorithm suffers from intractability problem.

Finally, here we are. This is the one alogrithm I was most excited to explain because this is what some of the ground-breaking ideas of this field born out of. Variational Inference (VI), although there in the literature for a long time, has recently shown very promising results on problems involving latent variables and deep structure. In the next post, I will go into some of those specific algorithms, but not today. In this article, I will go over the basic framework of VI and how it works.

The idea is really simple: **If we can’t get a tractable closed-form solution for , we’ll approximate it**.

Let the approximation be and we can now form this as an optimization problem:

By choosing a family of distribution flexible enough to model and optimizing over , we can push the approximation towards the real posterior. is KL-divergance, a distance between two probability distributions.

Now let’s expand the KL-divergence term

Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the *same annoying (intractable) integral* we were avoiding before. What do we do now ? This seems to be a deadlock !

Please recall that our original objective was a minimization problem over . We can pull a little trick here - **we can optimize only the first two terms and ignore the third term**. How ?

Because the third term is independent of . So, we just need to minimize

Or equivalently, maximize (just flip the two terms)

This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the yields even further insight

Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., ) towards a prior over . The prior is basically how the true latent space is organized. Now the immediate question might arise: “Where do we get from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the space is organized.

There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting definition, we get

As we know that for any two distributions, the following inequality holds

So, the that we vowed to maximize is a **lower bound** on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the , we can implicitely get closer to our dream of estimating maximum (log)-likelihood - *tighter the bound, better the approximation*.

Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous directions of research emerged from this point onwards. Its impossible to talk about all of them. But few directions, which succeded to grab attention of the community with its amazing formulations and results will be discussed in later parts of the tutorial series. One of them being “Variational AutoEncoder” (VAE). Stay tuned.

- “Variational Inference: A Review for Statisticians”, David M. Blei, Alp Kucukelbir, Jon D. McAuliffe
- “Pattern Recognition and Machine Learning”, C.M. Bishop
- “Machine Learning: A Probabilistic Perspective”, Kevin P. Murphy

`typesetting`

. If you happened to be from the scientific community, you must have gone through at least one document (maybe in the form of `.pdf`

or a printed paper) which is the result of years of developments in typesetting. If you are from technical/research background, chances are that you have even `LaTeX`

. Let me assure you that `LaTeX`

is neither the beginning nor the end of the entire typesetting ecosystem. In this article, I will provide a brief introduction to what typesetting is and what all modern tools are available for use. Specifically, the most popular members of the `TeX`

family will be introduced, including `LaTeX`

.
Although many people (including the ones who use it one way or the other) do not recognize this but typesetting is an **art**. Technically it’s defined as the process of arranging various symbols (letters, numbers & special characters) called glyphs on a physical paper or in a digital medium in a way that it is appealing as a *reading* material. I emphasized on that word “reading” because that’s the key - typesetting aims to produce documents that are *pleasant to the human eye*. You might ask, “So, what is the big deal here ?”. The answer is a set of (technical) terms/phrases which, I am pretty sure you haven’t even heard of. “Optimal line length”, “Ligatures”, “Italic correction”, “Hyphenation”, “Optimal spacing” etc. which, if not done right, may result in fatigue while reading. Please believe me at this point that there indeed is a science behind deciding what exactly is pleasant to the human eye and what’s not. I will try to illustrate few of them here:

Have a looks at the example above with two lines with identical content but one (top one) produced by a *typesetting* system and the other one is by a word processor. Did you notice any visual difference ? Let me help you.

**Ligature**is a special*glyph*which is formed by joining two*glyphs*of two letters (or graphemes). Canonical examples of ligature are the*grapheme*pair “ff” and “fi” both of which happened to be present in the content of the example sentence. The typeset sentence use just one*glyph*for both “ff” and “fi” which is not the case with word processors.**(Non-)optimal spacing**does affect the appeal of the text when read by human eye. The word “AVAST” clearly has more amount of spaces in between the letters in the later case which indeed is very awful. If you want more of it, look at character-pair “Fe” and “Ta” in the words “Feline” and “Table” respectively.**Hyphenation**is defined as the process of breaking words between lines. A better hyphenation algorithm can produce much less breakage of words in a given paragraph. The reason word processors are not-so-good at it is their hyphenation algorithm works on a single line and not on the entire paragraph. The below example (Fig.2; taken from here) should be self-explanatory.**Typesetting mathematics**is crucial when dealing with scientific documents. Scientific engineers/researchers will not be happy if their complicated equation looks ugly. Refer to Fig.3 for a visual comparison of equations.

There are many more of these. It’s difficult to discuss all of them here. If you are interested, read articles like this by expert typographers. In this article, I would rather focus on the tools available for digital typesetting.

Before we begin, it’s good to have an idea about how exactly these typeset documents are produced *digitally*. They are achieved by means of some specially crafted file formats:

**Device Independent**(`.dvi`

): DVI is a format created by**David R. Fuchs**and implemented by**Donald Knuth**as the primary output format of`TeX`

. DVIs are binary (encoded) files and are not intended to be readable as text. DVI viewers (e.g.`xdvi`

) can recognize and display them.**PostScript**(`.ps`

): PostScript (in short, called “PS”) is a very popular format used heavily in publishing industry created by Adobe. PS is a page description language (yes, its a full-fledged programming language) that describes a page by means of its commands. It is readable as text because of the fact that it is a source code of a programming language.**Portable Document Format**(`.pdf`

): Here comes the beast. PDF is a widely used document format used .. well .. everywhere. Created by Adobe, this format is intended to be dependency-free and a complete description of the document including text, images (raster/vector), fonts and other assets.

## Digital typesetting and history of `TeX`

:

**Digital typesetting** refers to the process of typesetting in digital medium and produce high quality printing material. So, a digital typesetting system must consume the *content* and *formatting* of the material we want to print and produce a DVI/PS/PDF which can then be used by the traditional printers.

It all started when this guy, Donald E. Knuth, felt the need of a reliable typesetting system because he had a bad experience with typesetting his book (The art of Computer Programming). Around 1977, while in Stanford, Knuth developed the very first version of `TeX`

- a digital typesetting engine that allows users to describe the *content* and *formatting* of a printing material by means of text files and can produce `.dvi`

s. TeX (pronounced as *tech*) is much like a programming language which takes source code as inputs and produces a beautifully typeset document. Although `TeX`

is a turing-complete programming language, it is mostly used as a *description language* which is flexible enough to describe not only content of the document but also granular formating details. TeX is highly popular in academia because of it’s ability to beautifully typeset mathematical notations and symbols. The core TeX engine uses quite sophisticated algorithms to address problems (the ones I described before like “optimal spacing”, “italic correction” etc.) which makes a document unpleasant to human eye. Although many things are automatic, TeX provides users with *granular control* over formatting details.

`TeX`

: The core typesetting engineOkay, enough of history and vague descriptions. Let me introduce you to the language of `TeX`

. In case you want to follow along, please install any complete TeX-distribution (`MikTeX`

, `TeXLive`

etc.) and you’ll get all the required tools ready for use. Here’s a simple TeX program (adopted from the TeXBook):

```
% filename: story.tex
\hsize=3in
\centerline{\bf A Short Story}
\vskip 6pt
\centerline{\sl Ayan Das}
\vskip .5cm
{\parindent=1em
\indent Once upon a time, in a distant galaxy called `\"O\"o\c c, there lived a computer named R.~J. Drofnats.`}.
{\parindent=2em
\indent Mr.~Drofnats---or ``R. J.,'' as he preferred to be called---was happiest when he was at work typesetting beautiful documents using \TeX.}
\bye
```

if *compiled* with the TeX engine

```
prompt$ tex story.tex
This is TeX, Version 3.14159265 (MiKTeX 2.9.7000 64-bit)
(story.tex [1] )
Output written on story.dvi (1 page, 680 bytes).
Transcript written on story.log.
```

produces a `.dvi`

which when opened with a DVI viewer, will look like this:

Although, the whole point of this tutorial is not to teach you TeX in detail, but I do want you to get a feel of how TeX accomplishes fine-quality typesetting programmatically. Here goes the explanation of the source code:

- The very first control sequence (yes, that’s what they are called)
`\hsize`

determines the width of the text area. Look how narrow the text is; its just 3 inches ! - Control sequence
`\centerline`

, as you can guess, centers a text. That`\bf`

is there to make all texts**boldface**inside its*enclosing braces*. Try to guess what that`\sl`

in the next-to-next line is for. - Couple of
`\vskip`

s are there to make*vertical gaps*. We can use several units of length (inches, point etc.) as per our convenience. `\parindent`

decides how much space to put for*paragraph indentation*. The first paragraph has a`\parindent=1em`

and the next one has`\parindent=2em`

which is quite evident from the output.- Did you notice the
*accents*and how they are written in the source code (`\"O\"o\c c`

) ? - The
`~`

sign represents a single*space*but with an extra instruction given to TeX to not*break the line at that point*while running it’s optimal line breaking algorithm.

I seriously have no intention to make it any longer, but I can’t resist myself to show you the level of *granularity* TeX offers:

- The control sequence
`\centerline`

is not a primary one - it is defined using a more fundamental concept called**Glue**. Think of them as*virtual springs*which have*stretchability*and*shrinkability*. Think of the*centering of a text*as putting two identical springs of*infinite stretchability*horizontally on both sides of the text. In the equilibrium, the text will be centered. Seems strange, right ? That`\centerline`

can then be (roughly) defined as

```
% This is how you define a control sequence with one argument
\def\centerline#1{
\hskip0pt plus 1fil #1\hskip0pt plus 1fil
}
```

where those `\hskip0pt plus 1fil`

s are the two glues/springs I mentioned earlier. Try to figure out what it exactly means (Hint: `1fil`

means a length of “Infinity with strength 1”).

- Another control sequence I want to bring your attention to is that
`\TeX`

at the very end of the second paragraph that produces the special`TeX`

-logo. It might seem like a primary command but it’s not - proper placing of that ‘E’ can be done with more fundamental commands like`\lower`

and`\kern`

:

```
\def\TeX{
T\kern-.2em\lower.5ex\hbox{E}\kern-.13em X
}
```

`\kern`

is there to produce a given amount of horizontal space. A negative number will cause the next character to overlap. `\lower`

, as you can guess, is for *lowering* the following *box* from its horizontal baseline by the given amount.

I hope I have successfully conveyed the essence of `TeX`

and the granularity/flexibility it offers. We will now move on to other members of `TeX`

family.

`LaTeX`

: A layer of abstractionIf you have ever heard about or worked with any one member of the TeX family, chances are, it is `LaTeX`

. `LaTeX`

was designed by **Leslie B. Lamport** (`LaTeX`

happened to be an abbreviation of `Lamport TeX`

) around 1983 as a document management system. It focuses heavily on *separating the content from formatting* as it helps users to focus more on the content. `LaTeX`

is technically a gigantic *macro package* of `TeX`

whose primary motive is to provide users with *document management* capabilities like “Automated page numbering”, “Automatic (sub)section formatting”, “Automatic Table-of-Content generation”, “Easy referencing mechanism” etc. This allows users not to worry about putting proper page numbers every time they create a new page or adding a new entry in the Table-of-Content every time they add a new section in the document.

Plain TeX would require you to do at least this much to produce a numbered list:

```
{\parindent=2em
\indent 1. First point}
\vskip1em
{\parindent=2em
\indent 2. Second point}
\vskip1em
{\parindent=2em
\indent 3. Third point}
\bye
```

whereas `LaTeX`

’s abstraction will allows you to have much more *focus on the content* rather than formatting details. Here’s how you would do it in LaTeX

```
\begin{enumerate}
\item First point
\item Second point
\item Third point
\end{enumerate}
```

Apart from being more readable and content-focused, the LaTeX version is much more feature-complete as it handles all possible situations you might get into. Similarly, (sub-)sectioning is just as easy:

```
\documentclass{article}
\begin{document}
\section{Introduction}
\subsection{Problem Statement}
Content for 'Problem Statement'
\subsection{History}
Content for 'History'
\subsection{Motivation}
Content for 'Motivation'
\section{Details}
\subsection{Analysis}
Content for 'Analysis'
\subsection{Experiments}
Content for 'Experiments'
\end{document}
```

will produce

`pdf(La)TeX`

: The There exist two members of the TeX family, namely `pdfTeX`

and `pdfLaTeX`

, which are essentially similar engines as `TeX`

and `LaTeX`

respectively but produces `.pdf`

s directly instead of `.dvi`

s. They may use some modern/advanced features that only PDFs offer. These two are extremely popular as the demand for `.pdf`

s are significantly higher than that of `.dvi`

s. `pdf(La)TeX`

is a separate program and implemented independently from `(La)TeX`

. They can accessed via command line programs

```
prompt$ pdftex file.tex
# produces file.pdf
```

and

```
prompt$ pdflatex file.tex
# produces file.pdf
```

`LuaLaTeX`

: When `LaTeX`

meets `Lua`

A successful attempt of extending the `pdfTeX`

engine by embedding `Lua`

in it was `LuaTeX`

(beware of the spelling, it’s not `LuaLaTeX`

). This engine, if used with the `LaTeX`

format, assumes the name `LuaLaTeX`

. `Lua(La)TeX`

is primarily used when a little more *dynamicity/flexibility* is required in the *source code*. Now that we understand how a `(La)TeX`

programs look like and how they work, I will directly go on showing some code rather than beating around the bush.

Before that, I would like to bring your attention to something which (La)TeX is not so good at. La(TeX) is known to be difficult when it comes to general purpose programming. To express very basic logics of programming, TeX needs a lot of unnecessary commands which are neither convenient nor readable as a source code. One very important logical block that almost every sensible program contains is a **for loop**. Here’s what TeX and LaTeX needs respectively in order to accomplish it.

```
\newcount\myvar % a command to define a variable
\myvar=1
\loop
\the\myvar % a command to access a variable, I mean seriously !
\advance\myvar1
\ifnum\myvar<5
\repeat
\bye
```

and

```
..
\usepackage{forloop}
..
\newcounter{ct}
\forloop{ct}{1}{\value{ct} < 5}
{
\thect\
}
```

If interested, you may try to read and understand it line by line. But we can agree on one thing - it’s nowhere near convenient or readable. Although the later one is somewhat easy to interpret, but it takes a separate package (called `forloop`

) to get to it.

Now, here’s how Lua helps.

```
\documentclass{article}
\begin{document}
\directlua{
for i = 1, 10
do
tex.print(i .. ' ')
end
}
\end{document}
```

You can see the very basic `LuaLaTeX`

command that makes the bridge between LaTeX with Lua is at work here. `\directlua`

enables users to write arbitrary Lua code inside it. Here’s how it works:

- The engine halts interpreting the usual LaTeX commands (i.e., stops typesetting) once it has encountered a
`\luacode`

block - The code inside this block is then fed into a special Lua interpreter for execution.
- The special
`tex.print(..)`

function (same`print()`

API from standard Lua) injects the characters into a special output stream. - The
`\luacode`

block is then replaced by the content of the output stream. - LaTeX engine starts its typesetting again from where it was halted.

Take a moment to digest this. Hopefully the explanation is clear enough to understand why the output looks this this

Constructs are available to define pure Lua functions as well. Also, convenient mechanisms are built to translate arguments given to a LaTeX command into equivalent Lua objects. A concrete example is shown below:

```
\documentclass{article}
\usepackage{luacode}
\begin{luacode*}
function intro_helper_tex(name)
tex.print('Hello, my name is ' .. name .. '. I love \\TeX')
end
function intro_helper_latex(name)
tex.print('Hello, my name is ' .. name .. '. I love \\LaTeX')
end
\end{luacode*}
\newcommand{\intro}[2]{
\directlua {
if \luastring{#2} == 'tex' then
intro_helper_tex(\luastring{#1})
elseif \luastring{#2} == 'latex' then
intro_helper_latex(\luastring{#1})
end
}
}
\begin{document}
\intro{Ayan Das}{latex}
\end{document}
```

`\begin{luacode*} .. \end{luacode*}`

is an environment to put pure Lua definitions. In our example, there are two functions namely`intro_helper_tex(..)`

and`intro_helper_latex(..)`

.- To understand the reason for
*escaping*the backslash, go and read the 4th point of the earlier explanation very carefully. The output stream generated by the`tex.print(..)`

s has to be**valid (La)TeX code**in order to be successfully parsed subsequently by the LaTeX engine. Escaping the backslash produces “” as a string in the output stream which is a valid (La)TeX command. - Coming to the custom command named
`\intro`

, it takes 2 inputs - your name and favorite TeX format. They are*translated*to Lua strings via`\luastring{#x}`

where`x`

is the argument number of`\intro`

. - Depending on the second argument, the
`if .. elseif .. end`

block choses one of the two Lua functions defined earlier.

The output of the above program, if compiled like this

```
prompt$ lualatex funcarg.tex
This is LuaTeX, Version 1.10.0 (MiKTeX 2.9.7000 64-bit)
restricted system commands enabled.
(./funcarg.tex
LaTeX2e <2018-12-01>
...
Output written on funcarg.pdf (1 page, 7187 bytes).
Transcript written on funcarg.log.
```

is

Phew ! That was a hell of a lengthy tutorial; but hopefully conveys the essence of typesetting and the `TeX`

family of tools. All the members of the `TeX`

family are themselves huge systems to learn about. With the introductory ideas given in this tutorial, it will be easier to read their official documentations available online.

Let’s take a simple problem to demonstrate my point. I am given a text file and I want to calculate the number of lines in it. The contents of the text file will be directed as standard input to the code so that the filename cannot be hardcoded or taken as input. This clearly rules out file input-output methods because in this tutorial I will be focusing on **standard input** (*stdin*) methods.

I will be sharing codes in JAVA, Python and C++. I have used standard functions in all the 3 languages which will be quite obvious while coding the solutions to the question. I haven’t tried to code the most optimized way of doing it as I am concerned mainly with the standard IO functions. The tests have been conducted on a text file **test.txt** of size **153.1 MB**. My special thanks to Soumyojit Chatterjee for providing a fast and concise Python code and it’s explanation.

```
import java.io.*;
class test
{
public static void main(String[] args)throws IOException
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
int lines = 0; String line = "";
while(true)
{
line = br.readLine();
if(line != null)
lines++;
else
break;
}
System.out.println("Number of lines = " + lines);
}
}
```

This is a simple solution to the line counting problem and since **BufferedReader** is used widely, I have used it here as well. The code in itself is self explanatory where, I am reading the input line by line and if no characters are read then break out of the loop and print the value of lines else increment the value of lines by 1.

```
[rohan@archlinux BlogCodes]$ javac test.java
[rohan@archlinux BlogCodes]$ time java -cp . test < test.txt
Number of lines = 3031040
real 0m0.595s
user 0m0.801s
sys 0m0.094s
```

Subsequent runs give more or less the same time.

```
import sys
print('Number of lines is =', sum(1 for _ in sys.stdin))
```

This solution given by Soumyojit Chatterjee is also very intuitive where a generator expression returns 1 for every line in the standard input. The number of 1’s are summed over to obtain the total number of lines.

```
[rohan@archlinux BlogCodes]$ time python test.py < test.txt
Number of lines = 3031040
real 0m0.516s
user 0m0.485s
sys 0m0.027s
```

```
#include <iostream>
int main()
{
size_t lines = 0; std::string line = "";
while(getline(std::cin, line))
lines++;
std::cout << "Number of lines: " << lines << '\n';
}
```

In this C++ solution, I am using the getline function to extract every line of the input and store it in a string variable and increment a counter by one as long as there are tokens in the standard input stream.

```
[rohan@archlinux BlogCodes]$ c++ test.cpp -O3
[rohan@archlinux BlogCodes]$ time ./a.out < test.txt
Number of lines: 3031040
real 0m1.792s
user 0m1.767s
sys 0m0.023s
```

Language | Time |
---|---|

Java | 0.595s |

Python | 0.516s |

C++ | 1.792s |

From the results, it is evident that C++ is lagging a lot behind here which is very bad considering the fact that C++ is a fast, compiled language used in performance critical situations and is a lower level programming language than both Python and JAVA.

Before trying to explain the reason for it’s slowness, let’s look at the optimized code first:

```
#include <iostream>
int main()
{
std::ios_base::sync_with_stdio(false);
std::cin.tie(0);
size_t lines = 0; std::string line = "";
while(getline(std::cin, line))
lines++;
std::cout << "Number of lines: " << lines << '\n';
}
```

Testing this code we obtain:

```
[rohan@archlinux BlogCodes]$ c++ test.cpp -O3
[rohan@archlinux BlogCodes]$ time ./a.out < test.txt
Number of lines: 3031040
real 0m0.111s
user 0m0.080s
sys 0m0.030s
```

A drastic 16x improvement just by adding two more lines!!! Let’s try to understand what’s going on here which brings such a massive improvement.

Historically C++ was designed as an extension to C. So much so that, C++ was initially known as *C with Classes* before being renamed to *C++* in 1983. Till today, backwards compatibility with C and the older standards of C++ is of **big importance** to the ISO C++ Committee. Due to this reason, the C-streams for input-output and the C++ iostreams also need to be synchronized so that when both of them are used in the same code, no undefined behaviour occurs. By default C++ streams are synchronized with their C-stream counterparts, i.e., the moment any operation is applied to any C++ stream, the same operation is also applied to the corresponding C-stream. This allows the free mixing of C and C++ streams in the same code but that comes at a big performance penalty as seen in the above case. The IO operations are unbuffered and are thread-safe by default when they’re synchronized. Thus the unbuffered nature of the iostreams and their synchronization with C-streams is the real reason why C++ iostreams are very slow. The line `std::ios_base::sync_with_stdio(false);`

removes this very synchronization between C and C++ streams. Then the C++ streams and the C streams maintain their buffers independently. Thus the removal of this synchronization and the conversion from an unbuffered to buffered behaviour gives a big speedup. The next line `std::cin.tie(0);`

is generally not required because in this case the removal of synchronization is generally enough to get the big speedup, but the next line removes the synchronization between the C++ input and output buffers which gives a slight more speedup in most cases.

On reading the above paragraph, it might appear to us that if we get such a big speedup why not add these two lines everytime we use C++ iostreams to speed up our input and output operations. Sounds too good to be true??? Had it been so, the synchronization would have been turned off by default. Let’s look at some big caveats on removing the synchronization to get a better understanding of the concept.

Here I will show the problem which arises on removing synchronization between C and C++ streams. Let’s investigate a seemingly innocent-looking code:

```
#include <iostream>
#include <stdio.h>
int main()
{
std::ios_base::sync_with_stdio(false);
for(int i = 0; i < 10; i++)
if(i % 2 == 0)
std::cout << i << '\n';
else
printf("%d\n", i);
}
```

Can you try to predict the output by yourself without looking at the answer below??? If you guessed the program to print the digits from 0 to 9 sequentially with one digit per line, you are surely mistaken!!!

```
[rohan@archlinux BlogCodes]$ c++ test.cpp
[rohan@archlinux BlogCodes]$ ./a.out
1
3
5
7
9
0
2
4
6
8
```

The reason for this strange output should become obvious by now. Since the synchronization between C and C++ streams have been removed, both of them maintain their buffers independent of each other. So when we are writing `printf("%d\n", i);`

or `std::cout << i << '\n';`

the respective buffer is filled. After the loop is terminated, before the program ends, the output buffer needs to be emptied. So while putting the data to the output stream, the C stream gets the preference and so the odd numbers which were present in the output buffer of printf gets printed followed by the data in the output buffer of cout. This preference is purely implementation dependant and might vary across different platforms.

Let’s now try to understand the problem faced on using `std::cin.tie(0);`

. To do that, let us investigate another piece of code:

```
#include <iostream>
int main()
{
std::ios_base::sync_with_stdio(false);
std::cin.tie(0);
std::string name; int age;
std::cout << "Enter your name: \n";
std::cin >> name;
std::cout << "Enter your age: \n";
std::cin >> age;
std::cout << "Your name: " << name << '\n';
std::cout << "Your age: " << age << '\n';
}
```

Please try to predict the input-output behaviour of the program before going forward. As you expected, the behaviour won’t be straightforward. If you thought that the program will first promt you to **Enter your name:** and then you enter your name, then prompts you to **Enter your age:** and then you enter your age and after that your name and age are printed with **Your name:** and **Your age:** prompts respectively; you are wrong again!!! Please look at the video carefully to understand the peculiar behaviour:

The video should be self explanatory. The pecularity occurs because with `std::cin.tie(0)`

the synchronization between cin and cout has been removed. Thus the data of the output buffer of cout is not printed as long as it is not full or the or the program has not reached it’s end. This is why the prompts **Enter your name:** and **Enter your age:** are sent to the output buffer but does not get printed and the program asks for input from the standard input stream. When all the input operations have been completed and there is no other job left other than printing the data to the standard output stream, the data gets printed.

These two programs should be enough to demonstrate the strange behaviour when IO synchronization is turned off. So please be careful and think twice before using these functions. With all these knowledge gained, let’s try to solve a problem which needs high speed input processing.

**Pro tip:** Never turn off synchronization in a header file. Else it might be included unknowingly leading to strange behaviour and may God help us all.

Let’s have a look at the Enormous Input Test problem in Codechef. This is one of the earliest beginner problems in Codechef and I feel that this is a great problem to start our discussion. The problem statement basically states that in the first line there will be two space separated integers **n** and **k**. The next **n** lines will have one integer (**t[i]**) each not exceeding **10^9**. It’s also given that both **n** and **k** are positive integers <= **10^7**. Our job is to find the number of integers **t[i]** which are divisible by **k**. To make it a bit more interesting, we will design our own code to generate the test cases and increase the bounds of **n** and **k** to **10^8**.

```
#include <iostream>
#include <random>
int main()
{
std::ios_base::sync_with_stdio(false);
std::cin.tie(0);
int n, k;
std::cin >> n >> k;
std::mt19937_64 rng; rng.seed(std::random_device()());
std::uniform_int_distribution<int> dis(1, 1000000000);
std::cout << n << ' ' << k << '\n';
for(int i = 1; i <= n; i++)
std::cout << dis(rng) << '\n';
}
```

Integers n and k are taken as input and the Mersenne Twister Engine is used as our Random Number Generator. With these, we are preparing our test case generator and saving our code to a file **generator.cpp** and compiling with the following commands:

```
[rohan@archlinux BlogCodes]$ c++ --version
c++ (GCC) 8.2.1 20181127
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[rohan@archlinux BlogCodes]$ c++ generator.cpp -std=c++17 -Ofast -march=native -o generator
```

I am using GCC 8.2.1 which has support for C++17. In case you are using older compilers, adjust your flags accordingly. To understand the optimization flags, have a look at the GCC optimization flags. In this case, I do not care about synchronization, so I have turned it off to speed up the generation process. Let us now generate three text files **test1.txt**, **test2.txt** and **test3.txt** with the bounds of **n** and **k** being 10^6, 10^7 and 10^8 respectively.

```
[rohan@archlinux BlogCodes]$ ./generator > test1.txt
1000000 3
[rohan@archlinux BlogCodes]$ ./generator > test2.txt
10000000 4
[rohan@archlinux BlogCodes]$ ./generator > test3.txt
100000000 5
[rohan@archlinux BlogCodes]$ ls -l
total 1071980
-rwxr-xr-x 1 rohan rohan 19792 Apr 8 03:22 generator
-rw-r--r-- 1 rohan rohan 384 Apr 8 03:20 generator.cpp
-rw-r--r-- 1 rohan rohan 9888480 Apr 8 03:48 test1.txt
-rw-r--r-- 1 rohan rohan 98890665 Apr 8 03:48 test2.txt
-rw-r--r-- 1 rohan rohan 988890850 Apr 8 03:49 test3.txt
```

Hurray!!! So we have created the three text files of size **9.4 MB**, **94.3 MB** and **943.1 MB** for testing. Let’s now start our attempts at solving this problem. We will be using the **time** command in Linux to maintain uniformity which gives reasonably accurate results for our purpose.

```
#include <iostream>
int main()
{
int n, k, count = 0;
std::cin >> n >> k;
for(int i = 1; i <= n; i++)
{
int x; std::cin >> x;
if(x % k == 0)
count++;
}
std::cout << count << '\n';
}
```

The code is straight-forward and doesn’t need any explanation. Let’s compile it and then verify that it is correct before testing it with our big text files.

```
[rohan@archlinux BlogCodes]$ c++ cin.cpp -O3 -march=native -o cin
[rohan@archlinux BlogCodes]$ ./generator > tmp.txt
7 5
[rohan@archlinux BlogCodes]$ cat tmp.txt
7 5
7852651
987500822
941299498
165101286
342858592
854548561
978235080
[rohan@archlinux BlogCodes]$ ./cin < tmp.txt
1
```

Thus the code is correct!!! Because it is clearly evident that there is just one number which is divisible by 5. Let’s now test it against our big guns :)

```
[rohan@archlinux BlogCodes]$ time ./cin < test1.txt
333392
real 0m0.292s
user 0m0.286s
sys 0m0.004s
[rohan@archlinux BlogCodes]$ time ./cin < test2.txt
2500370
real 0m2.664s
user 0m2.623s
sys 0m0.024s
[rohan@archlinux BlogCodes]$ time ./cin < test3.txt
20002602
real 0m27.761s
user 0m27.563s
sys 0m0.163s
```

Thus our timings for the big guns are **0.292s**, **2.664s** and **27.761s**. The timings for subsequent runs give more or less same timings. So there goes *cin*, can we do better???

```
#include <iostream>
int main()
{
std::ios_base::sync_with_stdio(false);
std::cin.tie(0);
int n, k, count = 0;
std::cin >> n >> k;
for(int i = 1; i <= n; i++)
{
int x; std::cin >> x;
if(x % k == 0)
count++;
}
std::cout << count << '\n';
}
```

I am sure that you guessed that this was coming. Turning off synchronization is an obvious solution, because we do not want the output to be interactive here and neither do we need interoperability with C streams. Let’s compile this code and test it against our big guns.

```
[rohan@archlinux BlogCodes]$ c++ cin_nosync.cpp -O3 -march=native -o cin_nosync
[rohan@archlinux BlogCodes]$ time ./cin_nosync < test1.txt
333392
real 0m0.122s
user 0m0.118s
sys 0m0.004s
[rohan@archlinux BlogCodes]$ time ./cin_nosync < test2.txt
2500370
real 0m0.911s
user 0m0.896s
sys 0m0.013s
[rohan@archlinux BlogCodes]$ time ./cin_nosync < test3.txt
20002602
real 0m8.865s
user 0m8.684s
sys 0m0.173s
```

That’s a decent performance boost over normal cin and cout!!! But this was expected. We have improved our timings to **0.122s**, **0.911s** and **8.865s**. Let’s try scanf and printf which are often recommended as alternatives when cin and cout perform slow.

```
#include <cstdio>
int main()
{
int n, k, count = 0;
scanf("%d %d", &n, &k);
for(int i = 1; i <= n; i++)
{
int x; scanf("%d", &x);
if(x % k == 0)
count++;
}
printf("%d\n", count);
}
```

Testing it:

```
[rohan@archlinux BlogCodes]$ c++ scanf.cpp -O3 -march=native -o scanf
[rohan@archlinux BlogCodes]$ time ./scanf < test1.txt
333392
real 0m0.088s
user 0m0.081s
sys 0m0.007s
[rohan@archlinux BlogCodes]$ time ./scanf < test2.txt
2500370
real 0m0.802s
user 0m0.791s
sys 0m0.010s
[rohan@archlinux BlogCodes]$ time ./scanf < test3.txt
20002602
real 0m8.244s
user 0m7.989s
sys 0m0.216s
```

The results are quite impressive!!! And we have further improved our timings by a slight margin. We have reached **0.088s**, **0.802s** and **8.244s**. Here we have seen that **scanf** and **printf** have outperformed **cin** and **cout**. But the results may vary slightly on other platforms and compilers.

Till now, we have been dealing with buffered input with scanf and unsynchronized cin where the data read from the standard input stream is stored internally in a buffer. Let’s now try a new approach where the data from the standard input stream are read one character at a time and instead of storing it in a buffer, it will be put to use immediately. We will use **getchar_unlocked** for this purpose.

It is to be kept in mind that, **getchar_unlocked** is a *POSIX function* and is available for *POSIX* systems (Linux, Mac etc). If you are using Windows, use **getchar** which is cross platform and is part of the C++ Standard. I will be using **getchar_unlocked** here because it is faster as it is *not thread-safe*, which does not matter as these are sequential programs, moreover I am using a Linux distribution, so I have this function available. Let’s look at the usage of this function for our use-case:

```
#include <iostream>
int input()
{
char c;
c = getchar_unlocked();
while(c <= ' ')
c = getchar_unlocked();
int s = 0;
while(c > ' ')
{
s = s * 10 + (c - '0');
c = getchar_unlocked();
}
return s;
}
int main()
{
int a, b, count = 0;
a = input();
b = input();
while(a--)
{
int x = input();
if(!(x % b))
++count;
}
std::cout << count << '\n';
return 0;
}
```

The **input** function takes in a positive integer using **getchar_unlocked** and returns it. If you have to use this function for taking input floating points, numbers or any other formats, modify the **input** function accordingly. I have deliberately not generalized the **input** function to keep it absolutely simple and specific for this problem. ASCII value of **space** is **32**. The digits **0 - 9** have ASCII values from **48** to **57**. In the **input** function, we input one character and as long as it is a space or any character before that, we keep on taking input as it is redundant data. The moment we get a character with ASCII value greater than 32, we break out of the while loop. We initialize an integer variable *s* with 0 which will store the final integer being entered by the user. So we enter another loop where we keep on taking input character by character as long as the ASCII value is greater than 32. Since we are dealing with numbers **only**, I need not give any other long if condition. The input character is then changed to the appropriate digit and appended at the end of **s**. For example, let’s say the user entered **956**. Initially, s = 0, so `s = 0 * 10 + (57 - 48)`

=> `s = 0 * 10 + 9`

=> `s = 9`

, as the ASCII value of **9** is **57** and that of **0** is **48**. Then when **5** is input, `s = 9 * 10 + (53 - 48)`

=> `s = 90 + 5`

=> `s = 95`

. Continuing in this manner, `s = 95 * 10 + (54 - 48)`

=> `s = 95 * 10 + 6`

=> `s = 956`

. This is how the **input** function works. The **main** function is same as before. Let’s look at how this solution performs:

```
[rohan@archlinux BlogCodes]$ c++ getchar.cpp -O3 -march=native -o getchar
[rohan@archlinux BlogCodes]$ time ./getchar < test1.txt
333392
real 0m0.023s
user 0m0.016s
sys 0m0.007s
[rohan@archlinux BlogCodes]$ time ./getchar < test2.txt
2500370
real 0m0.173s
user 0m0.159s
sys 0m0.014s
[rohan@archlinux BlogCodes]$ time ./getchar < test3.txt
20002602
real 0m1.648s
user 0m1.499s
sys 0m0.147s
```

That’s quite **FAST** and is a big improvement over the previous methods. We have processed a text file of size **943 MB** in **1.65s**, which gives a processing speed of about **572 MB/s**. Hence, this approach works like a charm. We have improved our timings to **0.023s**, **0.173s** and **1.648s**. Can we do any better???

Let’s pause for a moment and think about what we have done till now. We started out with **cin** which was synchronized with C-streams and unbuffered. We improved it by removing the synchronization, making it thread-unsafe and buffered. Then we tried **scanf** which gives nearly the same performance. In both these approaches, the Input Buffer is maintained internally. We have very less control over it. In our next approach with **getchar_unlocked** we completely removed the concept of buffers and are taking input character by character and processing it directly. This significantly reduced the input time.

Let’s take one more shot at buffered input with the subtle difference that, now we are going to manage the buffer **manually**. We will create and maintain our own buffer of a specific size. We will read a chunk of characters with every read operation and store that data in the buffer we are maintaining. And now we will iterate over that buffer and access the chunk of characters and process them. In this way, we are reducing the number of disk read operations, and are accessing the characters in the buffer which are stored in the **RAM** which is way faster. But then it might come to your mind that why not load the entire file into the **RAM** and then iterate over it? This is clearly a good possibility but is limited to small files only. For very large files, say a file of size **5 GB** or more, loading it into my **8 GB** RAM is clearly not a good idea. So the file is loaded into the RAM in chunks at each time which can fit into the memory easily. It’s obvious that the choice of the buffer size which holds the chunk of data changes with the amount of RAM available.

To implement the idea of the manual buffered input, we will use **fread**. Let’s have a look at the code to get a proper insight:

```
#include <iostream>
#include <vector>
int main()
{
int n, k, count = 0, num = 0, ans = 0; std::cin >> n >> k;
std::cin.get();
size_t size = 1024 * 1024;
std::vector<char> vec(size);
while(count < n)
{
size_t len = std::fread(vec.data(), sizeof(char), size, stdin);
if(len == 0)
break;
for(size_t i = 0; i < len; i++)
{
const char &ch = vec[i];
if(ch >= '0' && ch <= '9')
num = num*10+ch-'0';
else
{
if(num % k == 0)
ans++;
num = 0;
count++;
}
}
}
std::cout << ans << '\n';
}
```

I have used **cin** to take input for **n** and **k**. Now before using **fread** directly, we should keep in mind that, the trailing newline after **k** is not input by **cin**, so it will remain in the stream and will cause problems in our processing if taken in by **fread**. So I have used **cin.get** to grab that trailing newline and let **fread** deal with the other characters. After a lot of experimentation, I have found out that on my machine a buffer size of 1024 * 1024 (a million), works best in most cases. The rest of the code is quite straightforward and can be understood after a careful inspection. The vector of characters **vec** is the buffer we are maintaining. After every character is read, it is appended to **num** which is the number being created. The moment a newline is encountered, it denotes the end of a line, as well as a number. So the value of **count** is increased by 1 and since **num** is now complete, it is checked for divisibility by k and the value of **ans** is increased accordingly. Then the value of **num** is set to 0 again and reused. Let’s now have a look at **fread** in action:

```
[rohan@archlinux BlogCodes]$ c++ fread.cpp -O3 -march=native -o fread
[rohan@archlinux BlogCodes]$ time ./fread < test1.txt
333392
real 0m0.018s
user 0m0.007s
sys 0m0.011s
[rohan@archlinux BlogCodes]$ time ./fread < test2.txt
2500370
real 0m0.120s
user 0m0.090s
sys 0m0.030s
[rohan@archlinux BlogCodes]$ time ./fread < test3.txt
20002602
real 0m1.033s
user 0m0.952s
sys 0m0.080s
```

That’s **even faster** than **getchar_unlocked**!!! The effort of maintaining a manual buffer has actually paid off. We have now improved our timings to **0.018s**, **0.120s** and **1.033s**. Let’s have a look at a **pure C++** way of managing buffered input with a manual buffer just like **fread**.

**Note:** For getchar_unlocked and fread, never remove the synchronization between C and C++ streams as these are functions from the C standard library using C streams internally. We have used these functions in our C++ code.

```
#include <iostream>
#include <vector>
#include <string>
int main()
{
int n, k, count = 0, num = 0, ans = 0; std::cin>>n>>k;
const size_t buffer_size=1024*1024;
std::vector<char> buffer(buffer_size);
std::cin.get();
while(count < n)
{
std::cin.read(buffer.data(), buffer_size);
int len = std::cin.gcount();
if(len == 0)
break;
for(int i = 0; i < len; i++)
{
const char &ch = buffer[i];
if(ch >= '0' && ch <= '9')
num = num * 10 + ch - '0';
else
{
if(num % k == 0)
ans++;
num = 0;
count++;
}
}
}
std::cout << ans <<"\n";
}
```

The code is almost exactly like fread with the only exception that the number of parameters to **cin.read** are lesser and it does not return the number of characters read unlike **fread**. Instead we have to use another function **cin.gcount** for that purpose. Let’s test this solution against our big guns:

```
[rohan@archlinux BlogCodes]$ c++ cin_read.cpp -O3 -march=native -o cin_read
[rohan@archlinux BlogCodes]$ time ./cin_read < test1.txt
333392
real 0m0.034s
user 0m0.027s
sys 0m0.006s
[rohan@archlinux BlogCodes]$ time ./cin_read < test2.txt
2500370
real 0m0.111s
user 0m0.097s
sys 0m0.014s
[rohan@archlinux BlogCodes]$ time ./cin_read < test3.txt
20002602
real 0m1.024s
user 0m0.909s
sys 0m0.113s
```

That’s slightly faster than fread, but on an average, **fread** is almost as fast as **cin.read** for practical purposes. Thus we have improved our timings to **0.034s**, **0.111s** and **1.024s**. That’s a big improvement considering the fact that we had started out with a timing of **27.761s** for our **943.1 MB** text file.

IO Method | Size = 9.4 MB | Size = 94.3 MB | Size = 943.1 MB |
---|---|---|---|

cin | 0.292s | 2.664s | 27.761s |

cin_nosync | 0.122s | 0.911s | 8.865s |

scanf | 0.088s | 0.802s | 8.244s |

getchar_unlocked | 0.023s | 0.173s | 1.648s |

fread | 0.018s | 0.120s | 1.033s |

cin.read | 0.034s | 0.111s | 1.024s |

But then, I won’t be surprised if you are a bit disappointed with the last three solutions where the timings are almost equal. So to quench this thurst for differentiating between them let’s go beyond the limits and increase the value of **n** to **3 * 10 ^{8, let alone the Codechef limit of 10}7** because even our big guns have fallen short of the potential of these methods. The limit of

```
[rohan@archlinux BlogCodes]$ ./generator > test4.txt
300000000 120
[rohan@archlinux BlogCodes]$ time ./getchar < test4.txt
2500623
real 0m4.864s
user 0m4.369s
sys 0m0.490s
[rohan@archlinux BlogCodes]$ time ./fread < test4.txt
2500623
real 0m3.069s
user 0m2.789s
sys 0m0.277s
[rohan@archlinux BlogCodes]$ time ./cin_read < test4.txt
2500623
real 0m2.961s
user 0m2.622s
sys 0m0.323s
```

From the above results, it is pretty evident that **cin.read** ends up to be the fastest among the three followed by **fread** and then **getchar_unlocked**. But there was a small peculiarity that I noticed with **cin.read**. You’ll find that I have not removed the synchronization between C and C++ streams in the **cin.read** code despite using C++ streams only. In fact, removing the synchronization actually slowed it down a bit and puts it behind **fread**. This is mainly implementation dependant and might be different in your case.

By now you should have got a fair idea about speeding up your input methods in C++. If **cin** appears to be slow, you know you have a lot of other good options to fall back upon. That’s pretty much everything I had to say about the topic. For further reading have a look at fgets and sscanf and try using these functions alongwith the methods described above to suit your purpose. Also have a look at file input-output in C++ and memory mapping techniques for getting big benefits in file IO.

BibTeX citation:

```
@online{mark gomes2019,
author = {Mark Gomes, Rohan and Das, Ayan},
title = {Speeding up Iostreams in {C++}},
date = {2019-04-06},
url = {https://ayandas.me//blogs/2019-04-06-speeding-up-iostreams-in-c++.html},
langid = {en}
}
```

For attribution, please cite this work as:

Mark Gomes, Rohan, and Ayan Das. 2019. “Speeding up Iostreams in
C++.” April 6, 2019. https://ayandas.me//blogs/2019-04-06-speeding-up-iostreams-in-c++.html.

The three fundamental features which make C++ special are:

- Zero cost abstractions
- Generic programming
- Fairly low level

Being developed on the principle of ** you don’t pay for what you don’t use**, C++ allows us to write high-performant, efficient code. Comparison of Deep Learning software clearly shows that most of the deep learning libraries use C++ as one of their core programming languages.

For those who want to follow along, I’ll be using a Linux environment throughout with a GCC compiler.

```
[rohan@rohan-pc ~]$ g++ --version
g++ (GCC) 7.4.1 20181207
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[rohan@rohan-pc ~]
```

Before we begin let’s answer the most important and relevant question question in this regard.

You might be wondering that what’s wrong in putting all my source codes in source files (with `.cpp`

, `.cc`

, `.cxx`

, `.c++`

etc extensions) and then compile all of them to build the final executable. Well, there’s nothing wrong, but as the size and complexity of the program grows, you might end up having hundreds of source files. At this scale, the compilation of the project takes a *lot of time* and doing this every single time severely hampers the traditional compile-run-test-debug cycle which is impractical.

Moreover storing large projects purely in the form of source files take up a lot of space. If you are writing a closed source library which you want to distribute, for example, the Intel Math Kernel Library, which is a very fast, closed source, Linear Algebra library; you would want ship your library API to other programmers so that they can use it in their code, without releasing your implementation; source files are simply out of the question in this scenario.

The solution to the above problems is using some sort of binary files which are present in an encoded form (zeroes and ones) and are compressed, so that they take up less space than raw source files. To be more precise, this is applicable only when we are using `dynamic linking`

because as we will see static libraries consume a lot more space. These files can then be somehow *linked* to your source files and not recompiled every time you make changes to your own application source files thus vastly speeding up the compilation process. In general, whenever we are trying to use a block of code containing some functions or class definitions etc, which are not defined in our current source file or not present in the header files we included, the linker comes into play.

This technique of linking is so important and widely used, that in case you didn’t know about it, you were using it all this time unknowingly! Let’s investigate a simple hello world code in C++ as a motivational example:

```
#include <iostream>
int main()
{
std::cout << "Hello World\n";
}
```

And running the commands:

```
[rohan@archlinux LinkingExamples]$ g++ test.cpp
[rohan@archlinux LinkingExamples]$ ./a.out
Hello World
[rohan@archlinux LinkingExamples]$ ldd a.out
linux-vdso.so.1 (0x00007ffcb0f66000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f7c3ce0e000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f7c3cc89000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f7c3cc6f000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f7c3caab000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f7c3d00c000)
```

Here, libstdc++ is GNU’s implementation of the C++ Standard Library which is automatically linked dynamically to your executable binary i.e. `a.out`

. The `ldd`

command lists the runtime library dependencies. libc is the C Standard library which is also linked dynamically. Let us now try to link the standard libraries statically so that our new executable `a1.out`

does not have any runtime dependencies. This can be conveniently done in GCC with the help of the `-static`

flag.

```
[rohan@archlinux LinkingExamples]$ g++ test.cpp -o a1.out -static
[rohan@archlinux LinkingExamples]$ ./a1.out
Hello World
```

So far so good. Let us now run the `ldd`

command on `a1.out`

and investigate the sizes of the executable binaries:

```
[rohan@archlinux LinkingExamples]$ ldd a1.out
not a dynamic executable
[rohan@archlinux LinkingExamples]$ ls -l
total 2124
-rwxr-xr-x 1 rohan rohan 2150128 Jan 27 04:05 a1.out
-rwxr-xr-x 1 rohan rohan 17056 Jan 27 04:01 a.out
-rw-r--r-- 1 rohan rohan 68 Jan 27 04:01 test.cpp
```

The output of the `ldd`

command on `a1.out`

is quite expected but on comparing the sizes of `a.out`

and `a1.out`

, the results absolutely astonishing ! The size of `a.out`

is **17,056 bytes** whereas that of `a1.out`

is **2,150,128 bytes** which is about 126 times more than the size of `a.out`

. That, hopefully, gives you an idea of the size difference between shared and static libraries. In the above example, even though we ourselves have written very little code, the C and C++ standard libraries which we had statically linked contains a *lot of code* which resulted in a very fat binary.

We will now delve into the details of `static`

and `dynamic`

linking.

In static linking, the linker makes a copy of the library **implementation** and bundles it into a binary file. This binary is then linked to the application codes to produce the final binary. Let us understand this concept clearly with a hands-on example.

In this example, we will be creating our own extremely simple **Integer** class which has a `constructor`

, an `add(..)`

and a `display(..)`

function. Create a header file `example.hpp`

having the following code:

```
#include <iostream>
struct Integer
{
int data;
Integer(const int &x);
Integer add(const Integer &x);
void display();
};
```

Create a source file named `example.cpp`

to implement the functions:

```
#include "example.hpp"
Integer::Integer(const int &x)
{
data = x;
}
Integer Integer::add(const Integer &x)
{
return Integer(data + x.data);
}
void Integer::display()
{
std::cout << "Integer value : " << data << '\n';
}
```

Finally create an application file `test1.cpp`

to consume our Integer class:

```
#include "example.hpp"
int main()
{
Integer a(12), b(45);
Integer c = a.add(b);
c.display();
}
```

Our directory tree should now look like:

```
[rohan@archlinux LinkingExamples]$ tree
.
├── example.cpp
├── example.hpp
└── test1.cpp
0 directories, 3 files
```

A static library, by construction, is a library of **object codes** which when linked at compile-time, becomes a part of the application with no dependencies at runtime. So let’s create that object code now.

`[rohan@archlinux LinkingExamples]$ g++ -fPIC -c example.cpp`

This creates a new object file `example.o`

. The `-c`

flag tells GCC to stop after the compilation process and not run the linker and the `-fPIC`

flag stands for **position independent code** which is required for shared libraries but we will also use them for static linking to maintain uniformity.

Now we will be creating the static library using the `ar`

command.

`[rohan@archlinux LinkingExamples]$ ar -cq libexample.a example.o`

Hooray ! Our static library `libexample.a`

is now ready. `ar`

is GNU’s Archive tool. For more details check out the flags and the manual. Our directory structure should now look like:

```
[rohan@archlinux LinkingExamples]$ ls -l
total 20
-rwxrwxrwx 1 rohan rohan 136 Jan 6 21:45 example.hpp
-rw-r--r-- 1 rohan rohan 3240 Jan 26 18:16 example.o
-rwxrwxrwx 1 rohan rohan 233 Jan 6 21:44 ex.cpp
-rw-r--r-- 1 rohan rohan 3468 Jan 26 18:23 libexample.a
-rwxrwxrwx 1 rohan rohan 107 Jan 26 18:01 test1.cpp
```

To get a more insights into our static library, let’s generate the symbol table for `libexample.a`

:

```
[rohan@archlinux LinkingExamples]$ nm -gC libexample.a
example.o:
U __cxa_atexit
U __dso_handle
U _GLOBAL_OFFSET_TABLE_
U __stack_chk_fail
000000000000001c T Integer::add(Integer const&)
0000000000000078 T Integer::display()
0000000000000000 T Integer::Integer(int const&)
0000000000000000 T Integer::Integer(int const&)
U std::ostream::operator<<(int)
U std::ios_base::Init::Init()
U std::ios_base::Init::~Init()
U std::cout
U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char)
U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
```

Let’s now create our final executable `test1`

using our static library. It is to be noted that during compilation the linker should find our static library to successfully link it to our executable binary. So we should compile our test code `test1.cpp`

as:

`[rohan@archlinux LinkingExamples]$ g++ test1.cpp -L. -lexample -o test1`

Please look at the above command very carefully. The `-L`

flag gives the path of the directory in which the linker should search for libraries to link. It can be written as `-L`

followed by the path. In our case it is the current directory so you can use `-L.`

. The `-l`

flag gives the name of the library to be linked. The shared/static libraries *always* start with *lib* and so it is avoided and we can simply write `-lexample`

to link with `libexample.a`

.

```
[rohan@archlinux LinkingExamples]$ ./test1
Integer value : 57
[rohan@archlinux LinkingExamples]$ ls -l
total 40
-rwxrwxrwx 1 rohan rohan 136 Jan 6 21:45 example.hpp
-rw-r--r-- 1 rohan rohan 3240 Jan 26 18:16 example.o
-rwxrwxrwx 1 rohan rohan 233 Jan 6 21:44 ex.cpp
-rw-r--r-- 1 rohan rohan 3468 Jan 27 02:53 libexample.a
-rwxr-xr-x 1 rohan rohan 17592 Jan 27 02:53 test1
-rwxrwxrwx 1 rohan rohan 107 Jan 26 18:01 test1.cpp
```

Now it compiles and runs fine. When `example.a`

is in a different directory, the entire path to it needs to be provided. Let’s run the `ldd`

command on `test1`

.

```
[rohan@archlinux LinkingExamples]$ ldd test1
linux-vdso.so.1 (0x00007ffeb4ffa000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbaa4fbf000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007fbaa4e3a000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fbaa4e20000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fbaa4c5c000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fbaa51b8000)
```

We can only see `.so`

files i.e. shared libraries. These are runtime dependencies of `test1`

. We cannot find any static libraries listed as a dependency. Thus you can try deleting `example.a`

or `libexample.a`

and then running `test1`

. It’ll run fine as usual as it already has been bundled into our executable binary `test1`

. Another important thing to remember is that no matter how many times we want to make changes to `test1.cpp`

we need not recompile `example.cpp`

ever again.

In dynamic linking, the library object code is linked to the executable binary at **runtime**. So, now we have a runtime dependency on the library file. If that shared object (`.so`

file) is not discoverable at runtime, the executable binary won’t run. Let’s understand dynamic linking with the help of the same example we used to understand static linking.

The shared library **libexample.so** is created using the following command:

`[rohan@archlinux LinkingExamples]$ g++ -shared -o libexample.so example.o`

The `-shared`

flag tells GCC that a shared library is requested to be created. The symbol table of the newly created shared library can be generated in the same manner as we did before:

```
[rohan@archlinux LinkingExamples]$ nm -gC libexample.so
U __cxa_atexit@@GLIBC_2.2.5
w __cxa_finalize@@GLIBC_2.2.5
w __gmon_start__
w _ITM_deregisterTMCloneTable
w _ITM_registerTMCloneTable
U __stack_chk_fail@@GLIBC_2.4
0000000000001196 T Integer::add(Integer const&)
00000000000011f2 T Integer::display()
000000000000117a T Integer::Integer(int const&)
000000000000117a T Integer::Integer(int const&)
U std::ostream::operator<<(int)@@GLIBCXX_3.4
U std::ios_base::Init::Init()@@GLIBCXX_3.4
U std::ios_base::Init::~Init()@@GLIBCXX_3.4
U std::cout@@GLIBCXX_3.4
U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char)@@GLIBCXX_3.4
U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)@@GLIBCXX_3.4
```

Let us now create the final executable binary and run it:

```
[rohan@archlinux LinkingExamples]$ g++ test1.cpp -L. -lexample -o test1
[rohan@archlinux LinkingExamples]$ ./test1
Integer value : 57
```

Now it all looks very familiar and is exactly the same as static linking. But let’s have a look at the directory structure and the sizes of the files:

```
[rohan@archlinux LinkingExamples]$ ls -l
total 56
-rwxrwxrwx 1 rohan rohan 136 Jan 6 21:45 example.hpp
-rw-r--r-- 1 rohan rohan 3240 Jan 26 18:16 example.o
-rwxrwxrwx 1 rohan rohan 233 Jan 6 21:44 ex.cpp
-rwxr-xr-x 1 rohan rohan 16744 Jan 27 02:42 libexample.so
-rwxr-xr-x 1 rohan rohan 17128 Jan 27 02:46 test1
-rwxrwxrwx 1 rohan rohan 107 Jan 26 18:01 test1.cpp
```

If you compare the size of `test1`

executable now and in the case of static linking, you’ll notice that in the previous case, it was **17592 Bytes** and now it is **17128 Bytes**. The 464 bytes of memory savings is due to linking it dynamically at runtime rather than bundling it into `test1`

. You might say that 464 bytes is an absolutely negligible amount of memory, but then the shared and static libraries we have created hardly have any code in them, the Integer class having only a constructor and two methods all of which being one liners. As the source code size increases, it begins to create massive differences like the Hello World program I demonstrated in the beginning. Let’s us do something interesting. We will move libexample.so to a different directory and then try running test1 see what happens!!!

```
[rohan@archlinux LinkingExamples]$ mv libexample.so ~/libexample.so
[rohan@archlinux LinkingExamples]$ tree
.
├── example.hpp
├── example.o
├── ex.cpp
├── test1
└── test1.cpp
0 directories, 5 files
[rohan@archlinux LinkingExamples]$ ./test1
./test1: error while loading shared libraries: libexample.so: cannot open shared object file: No such file or directory
```

You might have expected this because the runtime linker cannot find `libexample.so`

. This can be fixed by altering the `LD_LIBRARY_PATH`

environment variable available in Linux. Since the linker searches for the shared object at runtime, recompilation is no longer needed. We just have to append the path which the linker can search at runtime to the `LD_LIBRARY_PATH`

variable and then run `test1`

.

```
[rohan@archlinux LinkingExamples]$ export LD_LIBRARY_PATH=~:$LD_LIBRARY_PATH
[rohan@archlinux LinkingExamples]$ ./test1
Integer value : 57
```

Thus we can edit the path to be searched by the linker at runtime! Thus both static libraries and shared libraries have their own pros and cons. Static libraries will be used when speed is important and the size of the executable binary is not important and no runtime dependencies should be involved. On running the `ldd`

command on `test1`

, `libexample.so`

is clearly listed as a runtime dependency found at `/home/rohan`

(my **${HOME}** directory):

```
[rohan@archlinux LinkingExamples]$ ldd test1
linux-vdso.so.1 (0x00007fff1c1f3000)
libexample.so => /home/rohan/libexample.so (0x00007f776e456000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f776e264000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f776e0df000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f776e0c5000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f776df01000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f776e462000)
```

Shared libraries are used to reduce the size of the executable binary and to share code at runtime. Moreover, it’s extremely useful in making changes to the underlying library at runtime without affecting the client code. Let’s see how it is done.

Let us edit the `ex.cpp`

file and add an extra line to the display function:

```
void Integer::display()
{
std::cout << "This is the new display function\n";
std::cout << "Integer value : " << data << '\n';
}
```

We will now compile it, create the shared library and use it in our production code:

```
[rohan@archlinux LinkingExamples]$ g++ -fPIC -c ex.cpp -o example.o
[rohan@archlinux LinkingExamples]$ g++ -shared -o libexample.so example.o
[rohan@archlinux LinkingExamples]$ ./test1
This is the new display function
Integer value : 57
```

That’s it ! We don’t have to recompile `test1.cpp`

. Just simply recompile our shared library. This is a very useful feature of shared libraries. With shared libraries, we can easily update our library code and release a newer version without touching our final executable binary. Thus shared libraries are also very important in real-life production environments.

An important topic which is yet to be covered is the **order of linking** the libraries. Have a look at this article to understand the order of static linking.

That’s pretty much everything I had to say about linking in C++. For further reading and better understanding, go through the detailed working of the GNU linker.

BibTeX citation:

```
@online{mark gomes2019,
author = {Mark Gomes, Rohan and Das, Ayan},
title = {Static and {Dynamic} Linking},
date = {2019-01-05},
url = {https://ayandas.me//blogs/2019-01-03-linking-in-c++.html},
langid = {en}
}
```

For attribution, please cite this work as:

Mark Gomes, Rohan, and Ayan Das. 2019. “Static and Dynamic
Linking.” January 5, 2019. https://ayandas.me//blogs/2019-01-03-linking-in-c++.html.

`python script.py`

. `Python`

:Before we begin, I want to clear out a very common misconception. It is widely assumed among beginner Python programmers that the term `Python`

refers to **the language and/or it’s interpreter** - that’s NOT technically correct. The term `Python`

should refer to the “language” only, i.e., the grammar of the language and NOT the interpreter (the command line tool named `python`

or `python2.7`

or `python3.5`

etc) that runs your python code. The interpreter you have access to is **one of the** implementations of the language grammar. There are more than one implementations of Python language interpreter:

- CPython: A python interpreter written in
**C language** - PyPy: A python interpreter with
**Just-in-time (JIT)**compilation - Jython: A python interpreter that uses the
**Java Virtual Machine (JVM)** - IronPython: A python interpreter that uses
**.NET framework**

Now, it so happened that `CPython`

is the implementation provided by the same group of people (including creator Guido Van Rossum) who are responsible for defining Python’s language grammar as a reference implementation. For several reasons, `CPython`

is also the mostly used Python interpreter out there. So that’s why it’s fairly reasonable, although not technically correct, to refer to the “Python language” plus the `CPython`

interpreter collectively as `Python`

. I am going to follow this convention throughout these tutorials.

Apart from clearing the misconception, there is another reason I said all this at the very beginning. The materials I am going to present in this tutorial are highly dependent of which interpreter we are talking about. Almost everything I am gonna talk about here is specific to `CPython`

interpreter.

Before we describe Python’s execution process, it is required to have a clear idea about what **Compilers** and **Interpreters** are and how they differ. In simple terms:

**Compilers**are programs that consume a program written in high level language and converts them down to machine code which the native CPU is capable of running directly. C/C++ is a typical example of a “compiled” language which can be compiled by one of severals implementations of a C/C++ compiler (famous ones being gcc by GNU, msvc by Microsoft, Clang by Google, icc by Intel etc).**Interpreters**, on the other hand, reads a high level source program statement-by-statement and executes them in an interpreter while keeping some kind of context alive. But very often, a so-called “interpreted language” is not truly interpreted from their source program, rather they are interpreted after the source has been converted down to some form of**intermediate representation**. Precisely, Python falls into this category.

`Bytecodes`

: The Intermediate RepresentationPython follows a two step process: first one being a *Compilation* step that converts our nice and sweet looking Python source code into an intermediate form called `Python bytecode`

. Let’s see how they look like. This little function

```
def f(x):
# computes (x^12 + x)
y = x ** 12
return x + y
```

is compiled by CPython’s `Bytecode compiler`

and produces

```
124, 0,
100, 1,
19, 0,
125, 1,
124, 0,
124, 1,
23, 0,
83, 0
```

Wow, they are so readable, isn’t it ? Huhh, kidding. Of course you cannot make sense of them like this. Let’s make it a bit more readable:

```
LOAD_FAST 0
LOAD_CONST 1
BINARY_POWER
STORE_FAST 1
LOAD_FAST 0
LOAD_FAST 1
BINARY_ADD
RETURN_VALUE
```

If you’ve ever worked with **assembly language** before, you are likely to find similarities between assembly language and the above bytecode (the readable one). And and sequence of integers (`124, 0, 100, ...`

) may also look like **machine code** to you. Guess what, you are kinda right. They are *equivalent* to assembly and machine code respectively. If you are wondering how CPython came up with this Bytecode representation, you should go through any good resource on general **Compiler Theory** because the way CPython generates Bytecodes is no different (but relatively simpler) than how a C/C++ compiler generates assembly code.

`Bytecode`

instruction set:Looking at “**LOAD_FAST**”, “**LOAD_CONST**”, “**BINARY_POWER**”, “**BINARY_ADD**” and all those instructions, you might have already suspected that there must be a whole bunch of these. The entire list of CPython’s bytecodes are available in Python’s official documentation. The exact design of CPython’s bytecodes instruction set is again version dependent - they might change slightly from version to version. So, for the sake of simplicity, I am choosing `CPython 3.6`

for this demonstration.

In `CPython 3.6`

, all bytecode instructions are of exactly *two bytes long* and have a format of

`<INSTRUCTION> <ARGUMENT>`

with one byte each. There is a total of 118 unique instructions in the set, among which, some of the instructions do not care about it’s argument as in they don’t have an argument. The second byte of such instructions are to be ignored by the VM. Every (human readable) instruction has integer representation (of course below 256, because it’s one byte). One can easily figure out the mapping for few of them by comparing the above not-so-readable and the readable version of the bytecode.

```
LOAD_FAST -> 124
LOAD_CONST -> 100
BINARY_POWER -> 19
...
```

There is a magic number defined along with the instruction set which determines whether a particular instruction requires argument or not. If the integer representation of an instruction is less than that number, it doesn’t have an argument - *that’s the rule*. In `CPython 3.6`

, the magic number happens to be 90. The “**BINARY_POWER**” does not have an argument (or rather ignores it) because it’s byte representation (i.e., 19) is less than 90.

Now that you have seen it all, let me disclose how I compiled the python program and where did I get these bytecodes from. It turns out that it’s very easy. The python interpreter allows you to peek into the bytecode from higher level python source code itself.

```
>>> def f(x):
y = x ** 12
return x + y
>>> f.__code__
<code object f at 0x7f976d44e810, file "<...>", line 1>
```

See that `f.__code__`

attribute - that’s the gateway to the compiled bytecode. `f.__code__`

contains some metadata as well. The raw bytecode can be retrieved from

```
>>> f.__code__.co_code
b'|\x00d\x01\x13\x00}\x01|\x00|\x01\x17\x00S\x00'
```

As it happened, some of the bytes are non-printable characters so they won’t show up properly on the screen. Converting them to integers will produce

```
>>> [int(x) for x in f.__code__.co_code]
[124, 0, 100, 1, 19, 0, 125, 1, 124, 0, 124, 1, 23, 0, 83, 0]
```

the exact sequence of bytes I showed earlier.

If you are a keen observer and had been looking at the bytecode since you saw it, you might have noticed that there is a discrepancy between the generated bytecode and the python source program. The “data” used by the program are missing. In this example, I have used a numeric value (12) in the statement `y = x ** 12`

. Did you see that anywhere in the compiled bytecode ? The point here is, the code and the data of a source program lives in different objects. Just like `__code__.co_code`

holds the raw bytecode instructions, there are few other attributes that are responsible for holding the data. One of them is

```
>>> f.__code__.co_consts
(None, 12)
```

which holds our numeric value (12) and a (useless) `None`

. I will explain the other ones in another tutorial as they are not really used in our running example. This way of looking at bytecodes is fine, but for more in-depth understanding you need something more convenient - the standard library module called `dis`

will help you:

```
>>> import dis # basically 'disassembler'
>>> dis.dis(f)
2 0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (12)
4 BINARY_POWER
6 STORE_FAST 1 (y)
3 8 LOAD_FAST 0 (x)
10 LOAD_FAST 1 (y)
12 BINARY_ADD
14 RETURN_VALUE
```

If you are really interested, go explore the `dis`

module. From my side, a detailed explanation of this output is the agenda of the next tutorial in this series.

After seeing all these, an immediate question you might be tempted to ask is whether these machine code lookalikes can be executed directly on a physical system ? **Unfortunately, NO**, they cannot. These bytecodes are not designed to be ran on any physical CPU. Rather, they are specially crafted to be consumed by a piece of software which is the second part of Python’s execution process - the Virtual Machine (VM).

`Python VM`

: Interpreting the BytecodesThe Python Virtual Machine (VM) is a separate program which comes into picture after the `Bytecode compiler`

has generated the bytecode. It is literally a **simulation of a physical CPU** - it has software defined `stack`

s, instruction pointer (IP) and what not. Although, other virtual machines may have a lot of other components like registers etc., but the `CPython`

VM is entirely based on a `stack`

data structure which is why it is often referred to as a “stack-based” virtual machine.

If you cannot get a feel of what it is, I have a dumb little code to show how Python VM is literally implemented. In reality, `CPython`

VM is implemented in `C`

but for simplicity, I am showing an *equivalent* but *terribly simplified* version in python itself:

```
def action_BINARY_POWER(args, state):
# ...
def action_RETURN_VALUE(args, state):
# ...
INSTRUCTION_SET = {
'BINARY_POWER': action_BINARY_POWER,
'RETURN_VALUE': action_RETURN_VALUE,
# ... a lot of these
# ...
'LOAD_CONST': # ...
}
class VirtualCPU:
# Power-ON for the CPU
def __init__(self):
# these represents the 'state' of the virtual CPU
# a simulated stack
self.stack = Stack()
# a simulated instruction pointer
self.IP = 0
# executes one instruction
def exec_instruction(self, instruction, instruction_arg):
action = INSTRUCTION_SET[instruction]
action(instruction_arg, (self.stack, self.IP))
# a running CPU
def main_loop(self, stream_of_instructions):
for instruction, argument in stream_of_instructions:
self.exec_instruction(instruction, argument)
# an instance
cpu = VirtualCPU()
```

Please spend a minute to digest this. This is almost how the Python VM is implemented internally. Upon feeding a stream of `(instruction, argument)`

pair, the `VirtualCPU.main_loop()`

will keep iterating over them and execute one instruction at a time. If the `main_loop()`

is fed with our previously compiled Bytecode program in a properly arranged data structure

```
cpu.main_loop([
('LOAD_FAST', 0),
('LOAD_CONST', 1),
('BINARY_POWER', None),
('STORE_FAST', 1),
('LOAD_FAST', 0),
('LOAD_FAST', 1),
('BINARY_ADD', None),
('RETURN_VALUE', None)
])
```

`cpu.exec_instruction(instruction, instruction_arg)`

will take one instruction and resolve the corresponding action routine (one of the `action_*`

functions) from a well-defined set of *instruction-action pairs* and execute that. Notice, the `action_*()`

functions take the state of the CPU as input (in this case the stack and the IP) because it has to manipulate the CPU’s state. This is exactly what happens inside a real CPU.

For a bit more clarity on the `action_*`

functions, let’s have a look at one instruction and it’s corresponding action function in a little more detail. Consider the “**BINARY_ADD**” instruction: it `pop`

s off two elements from the stack and `push`

es their addition result on the stack. This is roughly how it is implemented:

```
def action_BINARY_ADD(args, state):
# 'args' is None - that's useless here
stack, ip = state
addend = stack.pop()
augend = stack.pop()
stack.push(augend + addend) # manipulates the state
ip += 1 # manipulates the state
```

Please remember, this is just to show you the logical steps - the actual CPython interpreter is way more complicated, and of course, written in `C`

.

`.pyc`

filesIn principle, every time you run a script, python has to go through both *Compilation* and *Interpretation* steps. If you are running the same script over and over again, it is wasteful to execute bytecode compilation every time because a particular python source code will produce the same bytecode every time you do the compilation.

CPython uses a *caching* mechanism to avoid this. It writes down the bytecode the first time you load a module - yes, it happens at module level. If you load it again without modification, it will read the *cached* bytecode and go through the interpretation process only. These cached bytecodes are kept inside `.pyc`

files which you must have seen before:

```
prompt $ tree
.
├── __pycache__
│ └── module.cpython-36.pyc # THIS one
└── module.py
1 directory, 2 files
```

The bytecodes along with the data are *serialized* into these `.pyc`

files using a very special serialization format called `marshal`

which is strictly an internal format to python interpreter implementation and not supposed to be used by application programs. If you really want to know more about `marshal`

, see Stephane Wirtel’s video on Youtube.

BibTeX citation:

```
@online{das2019,
author = {Das, Ayan},
title = {Advanced {Python:} {Bytecodes} and the {Python} {Virutal}
{Machine} {(VM)} - {Part} {I}},
date = {2019-01-01},
url = {https://ayandas.me//blogs/2019-01-01-python-compilation-process-overview.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2019. “Advanced Python: Bytecodes and the Python
Virutal Machine (VM) - Part I.” January 1, 2019. https://ayandas.me//blogs/2019-01-01-python-compilation-process-overview.html.

From a computational perspective, we are living in the era of parallelism. But unfortunately, achieving parallelism with all it’s bells and whistles is no easy job and there are good reasons for that. Firstly, when programmers build up their foundations of programming in the early days of their academic careers, they mostly focus on the *serial* way of thinking, which obviously differs from the so called `parallel`

programming paradigm. Secondly, they often lack resources (hardware mostly) which is capable of handling parallel implementations. However, the second problem is somewhat a non-issue nowadays as most modern personal computer are capable of handling *some form* of parallelism (thread based). *Some other* form of parallelism is what you are here for - it’s called `Distributed computing`

.

`Distributed`

computing ?`Distributed computing`

refers to the way of writing a program that makes use several distinct components connected via network. Typically, large scale computation is achieved by such an arrangement of computers capable of handling high density numeric computation. In distributed computing terminology, they are often referred to as `node`

s and a collection of such `node`

s form a `cluster`

over the network. These nodes are usually connected via high-bandwidth network to take full advantage of distributed architecture. There are a lot more terminologies and concepts solely related to networking which are essential for an in-depth understanding of distributed systems, but due to the limited scope of this tutorial, I am skipping over them. We may encounter few of them incidentally as we go along.

`Deep learning`

benefit from distributed computing ?Deep learning (DL) is an emerging subfield of Artificial Intelligence (A.I.) which has already grabbed the attention of several industries and organizations. Although Neural Networks, the main workhorse of DL, has been there in the literature from quite a while, nobody could utilize it’s full potential until recently. One of the primary reasons for the sudden boost in it’s popularity has something to with massive computational power, the very idea we are trying to address in this tutorial. Deep learning requires training “Deep neural networks (DNN)” with massive amount of parameters on a huge amount a data. Both the size of the data and the network demanded specialized hardwares. With the introduction of General Purpose GPU (GPGPU), companies like NVidia opened up mammoth of opportunities for the academic researchers and industries to accelerate their DL innovation. Even after a decade of this GPU-DL friendship, the amount of data to be processed still gives everyone nightmare. It is almost impossible for a moderate or even a high end workstation to handle the amount of data needed to train DL networks. Distributed computing is a perfect tool to tackle the scale of the data. Here is the core idea:

A properly crafted distributed algorithm can * “distribute” computation (forward and backward pass of a DL model) along with data across multiple *nodes* for coherent processing * it can then establish an effective synchronization among the *nodes* to achieve consistency among them

`MPI`

: a distributed computing standardOne more terminology you have to get used to - **Message Passing Interface**, in short, `MPI`

. MPI is the workhorse of almost all of distributed computing. MPI is an open standard that defines a set of rules on how those “nodes” will talk to each other - that’s exactly what MPI is. It is very crucial to digest the fact that MPI is not a software or tool, it’s a *specification*. A group of individuals/organizations from academia and industry came forward in the summer of 1991 which eventually led to the creation of “MPI Forum”. The forum, with a consensus, crafted a *syntactic and semantic specification* of a library that is to be served as a guideline for different hardware vendors to come up with portable/flexible/optimized implementations. Several hardware vendors have their own *implementation* of `MPI`

:

- OpenMPI
- MPICH
- MVAPICH
- Intel MPI
- .. and lot more

In this tutorial, we are going to use Intel MPI as it is very performant and also optimized for Intel platforms.

This is where newcomers often fail. Proper setup of a distributed system is very important. Without proper hardware and network arrangements, it’s pretty much useless even if you have conceptual understanding of parallel and distributed programming model. The rest of this post will focus on how to setup a standard distributed computing environment. Although, settings/approach might be a little different depending on your platform/environment/hardware, I will try to keep it as general as possible.

First and foremost, you have to have access to more than one nodes, preferably, servers with high-end CPUs (e.g. Intel Xeons maybe) and Linux server installed. Although these are not a strict requirement for just executing distributing programs but if you really want to take advantage of distributing computing, you *need* them. Also make sure that they are connected via a common network. To check that, you can simple `ping`

from every node to every other node in the *cluster*, for which, you need to have (preferably) static IPs assigned to each of them. For convenience, I would recommend to assign hostnames to all of your nodes so that they can be referred easily when needed. Just add entries like these in the `/etc/hosts`

file of each of your nodes (replace the IPs with your own and put convenient names)

```
10.9.7.11 miriad2a
10.9.7.12 miriad2b
10.9.7.13 miriad2c
10.9.7.14 miriad2d
```

I have a cluster of 4 nodes where I can now do this ping test with the names assigned earlier

```
cluster@miriad2a:~$ ping miriad2b
PING miriad2b (10.9.7.12) 56(84) bytes of data.
64 bytes from miriad2b (10.9.7.12): icmp_seq=1 ttl=64 time=0.194 ms
64 bytes from miriad2b (10.9.7.12): icmp_seq=2 ttl=64 time=0.190 ms
```

```
cluster@miriad2b:~$ ping miriad2c
PING miriad2c (10.9.7.13) 56(84) bytes of data.
64 bytes from miriad2c (10.9.7.13): icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from miriad2c (10.9.7.13): icmp_seq=2 ttl=64 time=0.236 ms
```

For proper functioning of distributed systems, it is *highly recommended* to create a separate user account with **same name** on all the nodes and use that to do all of distributed programming. Although there are ways around it but it is strongly recommended to have a general purpose user account with the same name.

Look at the last two snippets from my terminal. There is a user named `cluster`

on all the nodes which I use for distributed computing.

`ssh`

connectivity :Typically, the *synchronization* I talked about earlier takes place over `ssh`

protocol. Make sure you have `openssh-server`

and `openssh-client`

packages installed (preferably via `apt-get`

). There is one more thing about `ssh`

setup which is **very crucial** for proper working of distributed computing. You have to have **password-less ssh** from any node to any other node in the cluster. It is a way to ensure “seamless connectivity” among the nodes, as in they can “talk to each other” (synchronization basically) whenever they need. If you are using *OpenSSH*, it is fairly easy to achieve this. Just use `ssh-keygen`

to create a public-private key pair and `ssh-copy-id`

to transfer it to the destination.

```
cluster@miriad2a:~$ ssh-keygen
... # creates a private-public key pair
cluster@miriad2a:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub miriad2b
... # ask you to enter password for 'cluster@miriad2b'
```

Once you enter the password, it won’t ask for it anymore which is exactly what we were up to. Notice one thing, the `ssh-copy-id ...`

command asked the password for `cluster@miriad2b`

although we never specified the username in the command - this is the perk of having same username.

Intel MPI needs to be downloaded (licensing required) and installed on all the nodes at *exact same location*. This is **IMPORTANT**. You need to have your Intel MPI of exact same version installed at same location on all nodes. For example, all my nodes have same path for the `mpi`

executable and they are of same version.

```
cluster@miriad2a:~$ which mpiexec
/opt/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpiexec
cluster@miriad2a:~$ mpiexec -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 2 Build 20180125 (id: 18157)
Copyright 2003-2018 Intel Corporation.
```

```
cluster@miriad2c:~$ which mpiexec
/opt/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpiexec
cluster@miriad2c:~$ mpiexec -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 2 Build 20180125 (id: 18157)
Copyright 2003-2018 Intel Corporation.
```

This is another crucial requirement. The executables containing our distributed application (DL training) must reside on the filesystem which is visible to all the nodes. Now, there are more efficient ways (by manipulating hardware) of doing it, but for the sake of tutorial we’ll take the most easy way - mounting a **Network Filesystem (NFS)**. We’ll be needing few packages to do so - `nfs-common`

and `nfs-kernel-server`

. Although the `nfs-kernel-server`

is not required in every node but it’s okay to make a complete setup. The point of setting up NFS is to have one specific directory visible from every node in the cluster. We will do all our distributed stuff inside that directory.

So, now we need to create a directory in one of the nodes which we’ll call our **master node** because this will be the source of the NFS. Add an entry like this in the `/etc/exports`

file.

```
cluster@miriad2a:~$ cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
# ...
/home/cluster/nfs *(rw,sync,no_root_squash,no_subtree_check)
```

`/home/cluster/nfs`

is the (empty) directory I made and decided to make the source of the NFS on my master node `miriad2a`

.

Now, to mount it, we need to add an entry like this in the `/etc/fstab`

on all the other nodes (except the *master*, of course)

```
cluster@miriad2b:~$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# ...
miriad2a:/home/cluster/nfs /home/cluster/nfs nfs
```

that entry `miriad2a:/home/cluster/nfs /home/cluster/nfs nfs`

means: I want to mount an `nfs`

whose source is on `miriad2a`

at the remote location `/home/cluster/nfs`

on my local location `/home/cluster/nfs`

. Make all the paths same.

Finally, restart your master node first and then restart all the other ones.

**Important Note:** Make sure you have successfully setup password-less shh before restarting.

**If everything goes well, you should have a cluster of nodes ready for distributed computing.**

Let’s *briefly* talk about `MPI`

’s programming model/interface. Although the original `Intel-MPI`

implementations is in **C language**, I would suggest using Intel distribution of Python which comes with a very convenient python wrapper on top of `Intel-MPI`

called `mpi4py`

(See doc here). For the sake of this tutorial and making it easy to digest, I have decided to use the same for demonstration.

Before writing any code, it is essential to understanding how to execute them. Because, the distributed system clearly is different from executing typical executables on a single platform. You need a way to “distribute” *processes* - your application program written using `MPI`

’s programming interface. Here comes the most important command-line utility in any MPI implementation: `mpiexec`

. Let’s see a trivial example of executing distributed processes with `mpiexec`

.

```
cluster@miriad1a:~/nfs$ mpiexec -n 2 -ppn 1 -hosts miriad1a,miriad1b hostname
miriad1a
miriad1b
```

Woo hoo .. We just ran our first distributed application. Let’s analyze it thoroughly:

- Although the utility can be invoked from any one of the nodes in a cluster, it is always advisable to choose one
*master*node and use it for scheduling. `mpiexec`

is basically a distributed scheduler which goes inside (via password-less ssh) each of your*slave*nodes and runs the command given to it.- The
`-n 2`

signifies the number of nodes to use (master plus slaves). - The
`-ppn 1`

signifies the number of*processes per node*. You can spawn more than one processes one a single node. - The
`-hosts <hostname1>,<hostname2>`

, as you can guess, tells`mpiexec`

which nodes to run your code on. No need to specify username here because they all have same username - MPI can figure that out. - The command after that is what we want
`mpiexec`

to run on all the specified nodes. In this stupid example, I only tried executing the command`hostname`

on all the specified nodes. If your application is a python script, you need to change it to`mpiexec ... "python <script.py>"`

.`mpiexec`

will only copy the given command as it is and execute it via ssh. So, it’s a responsibility of the user to make sure that the*given command is a valid one in the context of every node individually*. For example, launching a python program requires every node to have python interpreter and all required packages installed.

`mpi4py`

:```
from mpi4py import MPI
import platform
hostname = platform.node()
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
comm.send(hostname, dest = 1, tag = 6)
elif rank == 1:
recieved_data = comm.recv(source = 0, tag = 6)
print('{} got "{}" from rank 0'.format(platform.node(), recieved_data))
```

Upon invoking the scheduler from `miriad2a`

, we got the following output

```
cluster@miriad2a:~/nfs$ mpiexec -n 2 -ppn 1 -hosts miriad2a,miriad2b python mpitest.py
miriad2b got "miriad2a" from rank 0
```

- This python program will be executed on both the nodes specified (
`miriad2a`

and`miriad2b`

) with one process each. - They both will create a a variable called “hostname” which will store their respective hostnames (that’s what platform.node() does).
**Important:**Understanding the concept of`world`

and`rank`

- The term
`world`

refers to the collection of all the nodes that have been specified in a particular context of`mpiexec`

invocation. `Rank`

is an unique integer assigned by the MPI runtime to each of the*processes*. It starts from 0. The order in which they are specified in the argument of`-hosts`

is used to assign the numbers. So in this case, the process on`miriad2a`

will be assigned**Rank 0**and`miriad2b`

will be**Rank 1**. The object`comm`

is a handle to the communicator across all nodes/processes.

- The term
- A very common pattern used in distributed programming is

```
if rank == <some rank>:
# .. do this
elif rank == <another rank>:
# .. do that
```

which helps us to separate different pieces of code to be executed on different ranks (by ranks, I mean processes with that rank).

- In this example, Rank 0 is supposed to
`send`

a piece of data (i.e., the “hostname” variable) to Rank 1.

```
# send "hostname" variable to Rank 1 (i.e., dest = 1) with tag = 6
comm.send(hostname, dest = 1, tag = 6)
```

Although optional, the `tag`

is a (arbitrary) number assigned to a particular message/data to be sent; it is then used by destination rank for *identification* purpose.

- Rank 1 is supposed to
`receive`

the data from Rank 0 and print it out. The tag must be same, of course.

```
# the tag 6 properly identifies the message sent by Rank 0
recieved_data = comm.recv(source = 0, tag = 6)
# printing stuff
```

Before moving onto `PyTorch`

and `Deep learning`

in the next tutorial, it is required to have PyTorch installed and *properly linked to your MPI implementation*. I would recommend to have the `PyTorch`

source code and compile yourself by following the official instructions. If you have only one MPI implementation in usual location, `PyTorch`

’s build engine is smart enough to detect and link to it.

Okay then, see you in the next one.

BibTeX citation:

```
@online{das2018,
author = {Das, Ayan},
title = {Deep {Learning} at Scale: {Setting} up Distributed Cluster},
date = {2018-12-28},
url = {https://ayandas.me//blogs/2018-12-28-scalable-deep-learning.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2018. “Deep Learning at Scale: Setting up Distributed
Cluster.” December 28, 2018. https://ayandas.me//blogs/2018-12-28-scalable-deep-learning.html.

`Distributed computing`

and `MPI`

, and also demonstrated the steps of setting up a distributed environment. This post will focus on the practical usage of distributed computing strategies to accelerate the training of Deep learning (DL) models. To be specific, we will focus on one particular distributed training algorithm (namely `Synchronous SGD`

) and implement it using `PyTorch`

’s distributed computing API (i.e., `torch.distributed`

). I will use 4 nodes for demonstration purpose, but it can easily be There are two popular ways of parallelizing Deep learning models:

- Data parallelism
- Model parallelism

Let’s see what they are.

Model parallelism refers to a model being split into two parts, i.e., some layers in one part and some in other, then execute it by placing them on different hardwares/devices. In this strategy, we typically start with one single piece of data and pass it through the parts one by one. Although placing the parts on different devices does have execution benefits (asynchronous processing of data), it is usually employed to avoid memory constraint. Models with very large number of parameters, which are difficult fit into a single system due to high memory footprint, benefits from this type of strategy.

Data parallelism, on the other hand, refers to processing multiple pieces (technically *batches*) of data through multiple *replicas of the same network* located on different hardwares/devices. Unlike model parallelism, each replica may be an entire network and not a part of it. This strategy, as you might have guessed, can scale up well with increasing amount of data. But, as the entire network has to reside on a single device, it cannot help models with high memory footprints. The illustration above (taken from here) should make it clear.

Practically, *Data parallelism* is more popular and frequently employed in large organizations for executing production quality DL training algorithms. So, in this tutorial, we will fix our focus on data parallelism.

`torch.distributed`

API :Okay ! Now it’s time to learn the tools of the trade - PyTorch’s `torch.distributed`

API. If you remember the lessons from the last post, which you should, we made use of the `mpi4py`

package which is a convenient wrapper on top of original `Intel MPI`

’s *C* library. Thankfully, PyTorch offers a very similar interface to use underlying Intel MPI library and runtime. Needless to say, being a part of PyTorch, the `torch.distributed`

API has a very elegant and easy-to-use design. We will now see the basic usage of `torch.distributed`

and how to execute it.

First things first, you should check the availability of distributed functionality by

```
import torch.distributed as dist
if dist.is_available():
# do distributed stuff ...
```

Let’s look at this piece of code, which should make sense without explanation given that you remember the lessons from my last tutorial. But I am going to provide a brief explanation in case it doesn’t.

```
# filename 'ptdist.py'
import torch
import torch.distributed as dist
def main(rank, world):
if rank == 0:
x = torch.tensor([1., -1.]) # Tensor of interest
dist.send(x, dst=1)
print('Rank-0 has sent the following tensor to Rank-1')
print(x)
else:
z = torch.tensor([0., 0.]) # A holder for recieving the tensor
dist.recv(z, src=0)
print('Rank-1 has recieved the following tensor from Rank-0')
print(z)
if __name__ == '__main__':
dist.init_process_group(backend='mpi')
main(dist.get_rank(), dist.get_world_size())
```

Executing the above code results in:

```
cluster@miriad2a:~/nfs$ mpiexec -n 2 -ppn 1 -hosts miriad2a,miriad2b python ptdist.py
Rank-0 has sent the following tensor to Rank-1
tensor([ 1., -1.])
Rank-1 has recieved the following tensor from Rank-0
tensor([ 1., -1.])
```

The first line to be executed is

`dist.init_process_group(backend)`

which basically sets up the software environment for us. It takes an argument to specify which backend to use. As we are using MPI throughout, it’s`backend='mpi'`

in our case. There are other backends as well (like`TCP`

,`Gloo`

,`NCCL`

).Two parameters need to be retrieved: the

`rank`

and`world size`

which exactly what`dist.get_rank()`

and`dist.get_world_size()`

do respectively. Remember, the world size and rank totally depends on the context of`mpiexec`

. In this case, world size will be 2;`miriad2a`

and`miriad2b`

will be assigned rank 0 and 1 respectively.`x`

is a tensor that Rank 0 intend to send to Rank 1. It does so by`dist.send(x, dst=1)`

. Quite intuitive, isn’t it ?`z`

is something that Rank 1 creates before receiving the tensor. Yes ! you need an already created tensor of*same shape*as a*holder*for catching the incoming tensor. The values of`z`

will eventually be replaced by the values in`x`

.Just like

`dist.send(..)`

, the receiving counterpart is`dist.recv(z, src=0)`

which receives the tensor into`z`

.

What we saw in the last section is an example of “*peer-to-peer*” communication where rank(s) send data to specific rank(s) in a given context. Although this is useful and gives you granular control over the communication, there exist other standard and frequently used *pattern* of communication called `collective`

s. A full-fledged explanation of these `collective`

s is beyond the scope of this tutorial, but I will describe one particular collective (known as `all-reduce`

) which is of interest to us in the context of Synchronous SGD algorithm.

The common `collective`

s are: 1. Scatter 2. Gather 3. Reduce 4. Broadcast 5. All-reduce 6. etc.

`all-reduce`

collective**All-reduce** is basically a way of synchronized communication where “*a given reduction operation is operated on all the ranks and the reduced result is made available to all of them*”. The above illustration hopefully makes it clear. Now, it’s time for some codes.

```
def main(rank, world):
if rank == 0:
x = torch.tensor([1.])
elif rank == 1:
x = torch.tensor([2.])
elif rank == 2:
x = torch.tensor([-3.])
dist.all_reduce(x, op=dist.reduce_op.SUM)
print('Rank {} has {}'.format(rank, x))
if __name__ == '__main__':
dist.init_process_group(backend='mpi')
main(dist.get_rank(), dist.get_world_size())
```

when launched in a *world* of 3, it produces

```
cluster@miriad2a:~/nfs$ mpiexec -n 3 -ppn 1 -hosts miriad2a,miriad2b,miriad2c python ptdist.py
Rank 1 has tensor([0.])
Rank 0 has tensor([0.])
Rank 2 has tensor([0.])
```

The same

`if rank == <some rank> .. elif`

pattern we encounter again and again. In this case, it is used to create different tensors on different rank.They all agree to execute an

`all-reduce`

(see that`dist.all_reduce()`

is outside`if .. elif`

block) with “*SUM*” as reduction operation.`x`

from every rank is summed up and the summation is placed inside the same`x`

of every rank. That’s how all-reduce works.

Well ! Enough of bare bone distributed computing. Let’s dive into what we are actually here for - Deep learning. I assume the reader is familiar with the “*Stochastic Gradient Descent (SGD)*” algorithm which is often used to train deep learning models. We will now see a variant of SGD (called **Synchronous SGD**) that makes use of the `all-reduce`

collective. To lay the foundation, let’s start with the mathematical formulation of standard SGD:

The update equation look like

where is a set (mini-batch) of samples, is the set of all parameters, is the learning rate and is some loss function averaged over all samples in .

The core trick that Synchronous SGD relies on is splitting the summation over smaller subsets of the (mini)batch. If is split into number of subsets (preferably with same number of samples in each) such that

Now splitting the summation in using leads to

Now, as the gradient operator is distributive over summation operator, we get

**What do we get out of this ?** Have a look at those individual terms in . They can now be computed independently and summed up to get the original gradient without any loss/approximation. This is where the **data parallelism** comes into picture. Here is the whole story:

- Split the
*entire dataset*into equal chunks. The letter is used to refer to “*Replica*”. - Launch processes/ranks using
`MPI`

and bind each process to one chunk of the dataset. - Let each worker compute the gradient using a mini-batch () of size from it’s own portion of data, i.e., rank computes
- Sum up all the gradients of all the ranks and make the resulting gradient available to all of them to proceed further.

The last point should look familiar. That’s exactly the `all-reduce`

algorithm. So, we have to execute all-reduce every time all ranks have computed one gradient (on a mini-batch of size ) on their own portion of the dataset. *A subtle point to note here* is that summing up the gradients (on batches of size ) from all ranks leads to an effective batch size of

Okay, enough of theory and mathematics. The readers deserve to see some code now. The following is just the crucial part of the implementation

```
model = LeNet()
# first synchronization of initial weights
sync_initial_weights(model, rank, world_size)
optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.85)
model.train()
for epoch in range(1, epochs + 1):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
# The all-reduce on gradients
sync_gradients(model, rank, world_size)
optimizer.step()
```

All ranks create their own copy/

*replica*of the model with random weights.Individual replicas with random weights

*may*lead to initial de-synchronization. It is preferable to synchronize the initial weights among all the replicas. The`sync_initial_weights(..)`

does exactly that. Let any one of the rank “send” its weights to its siblings and the siblings must grab them to initialize themselves.

```
def sync_initial_weights(model, rank, world_size):
for param in model.parameters():
if rank == 0:
# Rank 0 is sending it's own weight
# to all it's siblings (1 to world_size)
for sibling in range(1, world_size):
dist.send(param.data, dst=sibling)
else:
# Siblings must recieve the parameters
dist.recv(param.data, src=0)
```

Fetch a mini-batch (of size ) from the respective portion of a rank and compute forward and backward pass (gradient). Important note to remember here as a part of the setup is all processes/ranks should have it’s own portion of data visible (usually on it’s own hard-disk OR on a shared Filesystem).

Execute

`all-reduce`

collective on the gradients of each replica with*summation*as the reduction operation. The`sync_gradients(..)`

routine looks like this:

```
def sync_gradients(model, rank, world_size):
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
```

- After gradients have been synchronized, every replica can execute an SGD update on it’s own weight
*independently*. The`optimizer.step()`

does the job as usual.

Question might arise, *How do we ensure that independent updates will remain in sync ?* If we take a look at the update equation for the first update

We already made sure that and are synchronized individually (Point 2 & 4 above). For obvious reason, a linear combination of them will also be in sync ( is a constant). A similar logic holds for all consecutive updates.

The biggest bottleneck for any distributed algorithm is the synchronization/communication part. Being an I/O bound operation, it usually takes more time than computation. Distributed algorithms are beneficial **only if the synchronization time is significantly less than computation time**. Let’s have a simple comparison between the standard and synchronous SGD to see when is the later one beneficial.

Definitions:

- Size of the entire dataset:
- Mini-batch size:
- Time taken to process (forward and backward pass) one mini-batch:
- Time taken for synchronization (all-reduce):
- Number of replicas:
- Time taken for one epoch:

For **non-distributed (standard) SGD**,

For **Synchronous SGD**,

So, for the distributed setting to be beneficial over non-distributed setting, we need to have

OR equivalently

The three factors contributing to the inequality (3) above can be tweaked to extract more and more benefit out of the distributed algorithm.

- can be reduced by connecting the nodes over a high bandwidth (fast) network.
- is not really an option to tweak as it is fixed for a given hardware.
- can be increased by connecting more nodes over the network and having more replicas.

That’s it. Hope you enjoyed the tour of a different programming paradigm. See you.

BibTeX citation:

```
@online{das2018,
author = {Das, Ayan},
title = {Deep {Learning} at Scale: {The} “Torch.distributed” {API}},
date = {2018-12-28},
url = {https://ayandas.me//blogs/2019-01-15-scalable-deep-learning-2.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2018. “Deep Learning at Scale: The
‘Torch.distributed’ API.” December 28, 2018. https://ayandas.me//blogs/2019-01-15-scalable-deep-learning-2.html.

`Generator`

s, `Decorator`

s and `Context Manager`

s, among which, I have already dedicated a full-fledged post on the first one - `Generator`

s. This one will be all about `Decorator`

s and some of it’s lesser known features/applications. Without further I do, let’s dive into it.
`Decorator`

s ?Simply put, `Decorator`

s are *functionals* which transform one function to another. In principle, Decorators can perform quite complex transformations, but the specific type of transformation Decorators are mostly used for is *wrapping*, i.e., it can consume one function (normal python function) and put some code *around it*, like so

```
def foo(*args, **kwargs):
# .. definition of foo()
```

can be transformed into

```
def transformed_foo(*args, **kwargs):
pre_code(*args, **kwargs)
foo(*args, **kwargs)
post_code(*args, **kwargs)
```

Here, `pre_code(...)`

and `post_code(..)`

signify arbitrary code blocks which executes before and after the `foo(...)`

function, respectively. Hence, the name `Decorator`

- it “decorates” a given function by wrapping around it. Point to note hete is that the `pre_code(..)`

and `post_code(..)`

may have access to the parameters intended to be passed into the original function `foo(..)`

.

At this point, a typical example of the `Decorator`

syntax would have been enough to end the discussion, but it is important to grasp few more concepts on which the idea of `Decorator`

relies on.

`closure`

and `non-local`

variables:`Closure`

typically appears in the context of *nested functions* (function inside the scope of another function). A `Closure`

or `Closure function`

is basically a *function object* that “remembers” the objects in its *defining scope*. Those objects are called `non-local`

s to the closure. The cannonical example to describe `Clousre`

and `non-local`

s is:

```
def outer():
ver = 3.6 # <- non-local to inner()
lang = 'Python' # <- non-local to inner()
def inner():
# inner() has access to 'ver' and 'lang'
print('{} {}'.format(lang, ver))
inner()
```

If we call `inner()`

inside the `outer()`

function, the result will not be of any surprise as it is equivallent to defining a function (i.e., `inner()`

) and calling it in the global scope. *BUT*, what if the `inner`

function object (it is a function object untill we call it with `()`

syntax) is returned and taken outside its *defining scope* (i.e., `outer()`

) and then called ?

```
def outer():
ver = 3.6 # <- non-local to inner()
lang = 'Python' # <- non-local to inner()
def inner():
# inner() has access to 'ver' and 'lang'
print('{} {}'.format(lang, ver))
return inner # <- returns the 'inner' function object
f = outer() # <- 'inner' function object is now out of it's defining scope
f() # <- and then called
```

A programmer with a decent C/C++ background, would be tempted to suggest that this code is erronious because of the fact that the objects inside `outer()`

function (`ver`

and `lang`

) are no longer alive and the `inner`

function object can no longer refer to them when called. **NO ! Python is a bit different**. Now, let me connect the definitions with the example. The `Closure function`

object `inner`

still have access to its `non-local`

objects (defined in it’s defining scope, i.e., inside `outer`

function) and hence won’t complaint when `f`

(basically `inner`

) is called. The output will be

```
>>> f = outer() # <- 'f' now points to the 'inner' function object
>>> f()
Python 3.6
```

To prove the point of `inner`

“remember”-ing the *non-local*s, have a look at this `Python 3`

-specific way of accessing the *non-local* objects from a function (object):

```
>>> f.__closure__[0].cell_contents # <- peeking into inner's memory
Python
>>> f.__closure__[1].cell_contents # <- peeking into inner's memory
3.6
```

Equipped with the idea of `Closure`

s and `non-local`

s, we are now ready to see an example of a `Decorator`

.

`Decorator`

s :```
def decorate(func):
# 'func' is basically an object in the scope of 'decorate()'
def closure(*args, **kwargs):
print('Execution begins')
func(*args, **kwargs)
print('Execution ends')
return closure
```

Syntactically, the definition of a `Decorator`

is no different than the *Closure* example we saw before. The outer function essentially *represents* a `Decorator`

which, in this case, takes a function object as input and produces another function object - that does proves my initial claim about `Decorator`

s being *functionals*, isn’t it ? The function object it returns is basically the `closure()`

function which remembers `func`

as a *non-local* object and hence can invoke it (after `print('Execution begins')`

and before `print('Execution ends')`

).

Now all you need is a function to decorate and applying the `Decorator`

on it, like so

```
def sum_original(*args):
s = 0
for arg in args:
s += arg
print('summation result is', s)
sum_transformed = decorate(sum_original)
```

Invoking `sum_transformed(...)`

will result in

```
>>> sum_transformed(1,2,3,4,5)
Execution begins
summation result is 15
Execution ends
```

Python has a cleaner (and almost always used) syntax for *decorating* a function automatically after defining it. Point to be noted here that the name of the *transformed function*, in this case, remains same (i.e., `sum`

in the below example). It looks like this:

```
@decorate # <- this means: go decorate the function after defining it
def sum(*args):
s = 0
for arg in args:
s += arg
print('summation result is', s)
# Here onwards, 'sum' will behave as the transformed/decorated version of it
>>> sum(1,2,3,4,5)
Execution begins
summation result is 15
Execution ends
```

Although it is often not an issue, but an able programmer should know about possible consequences of a feature, if any. *Returning a function object* has an unintended side effect - *it loses it’s name*. Python being an extremely dynamic language, it stores the names (identifiers) of objects as a string within it. These names can be accessed by the `.__name__`

attribute of the corresponding object. Let’s check with a dummy example:

```
def foo():
pass
>>> foo.__name__
foo
```

That’s trivial, isn’t it ? Let’s try with our (decorated) `sum`

function:

```
>>> sum.__name__
closure
```

Oops, what happened ?

Basically, when we returned the `closure`

function object from `decorate(..)`

function, it still had `'closure'`

in it’s `.__name__`

attribute (because it was born with that name). By collecting the function object with new identifier (`sum`

in this case) outside the scope of `decorate(..)`

, only the ownership got transferred but the content (all it’s attributes) remained same. So, essentially the `sum`

function object inherited the `.__name__`

from `closure`

, hence the output.

This can be prevented by decorating the closure function by a standard library defined decorator. This is how it works:

```
from functools import wraps
def decorate(func):
@wraps(func) # <-- Here, THIS is the way to do it
def closure(*args, **kwargs):
print('Execution begins')
func(*args, **kwargs)
print('Execution ends')
return closure
@decorate
def sum(*args):
s = 0
for arg in args:
s += arg
print('summation result is', s)
```

Now, visiting the `.__name__`

attribute of `sum`

will result in

```
>>> sum.__name__
sum
```

Maybe it would be nice to implement the `functools.wraps`

function yourself. I am leaving it to the reader as an exercise.

`Decorator`

s with arguments :**Decorator**s, just like normal functions, can have arguments. It is useful in cases where we want to customize the decoration. In our running example, we may want to change the default decoration messages (i.e. “Execution begins” and “Execution ends”) by providing our own.

To do this, all you need is a function that *outputs a decorator*. Please notice the subtle difference here - we now need a function that throws a *Decorator* as return value, which in turn will throw a *closure object* as usual. Yes, you got it right - it’s a two level nested function:

```
from functools import wraps
def make_decorator(begin_msg, end_msg):
########### The Decorator ##################
def decorate(func): #
@wraps(func) #
def closure(*args, **kwargs): #
print(begin_msg) # <- custome msg #
func(*args, **kwargs) #
print(end_msg) # <- custome msg #
return closure #
########### The Decorator ##################
return decorate # <-- returns the "Decorator function"
@make_decorator('the journey starts', 'the journey ends')
def sum(*args):
s = 0
for arg in args:
s += arg
print('summation result is', s)
```

Here, the `begin_msg`

and `end_msg`

will act as `non-local`

s to the `decorate(..)`

function. Invoking `sum(..)`

will result:

```
>>> sum(1,2,3,4)
the journey starts
summation result is 10
the journey ends
```

`Decorator`

s :Much like functions, classes can also be *decorated*, and guess what, the syntax is exactly same (the `@...`

one). But Class decorators, in functionality, are much flexible and powerful as they can potentially change the structure (definition) of the class. To be precise, class decorators can add/remove/modify class members as well as the special functions (`__xxx__`

function) from a class - in short, they can take the *guts* of the class out or replace them. They have a very common implementation pattern and this is how they look like from a higher level:

```
def classdecor(cls):
# input is a 'class'
cls.static_attr = new_static_attr # add/modify static attribute
cls.member_func = new_member_func # add/modify member functions
do_something(cls)
return cls # return the 'cls'
```

**IMPORTANT** point to note: The Class decorators work on the *class definition* and not on objects/instances (of that class). **The class decorators run before any instance of that class has ever been created**. So, this is how syntactically it looks like and how internally it’s expanded:

```
@classdecor
class Integer:
# ...
```

is converted to

`Integer = classdecor(Integer)`

Now I would conclude with a complete example (and it’s explanation) on how class decorators can be used.

```
def decorate(func):
# Looks familiar ? This is our good old function decorate :)
def closure(*args, **kwargs):
print('member function begins execution')
func(*args, **kwargs)
print('member function ends execution')
return closure
# This is the "Class decorator"
def classdecor(cls):
# decorates the ".show()" member function with "decorate"
cls.show = decorate(cls.show)
return cls
@classdecor
class Integer:
def __init__(self, i):
self.i = i
def show(self):
print(self.i)
```

As you can understand the point of this class (i.e., `Integer`

) - a simple abstraction on top of `int`

. The class decorator is basically consuming the class, replacing it’s `.show()`

function with a *decorated version of it* and returning it back. So, whenever I call `.show()`

, this is gonna happen (I think the reader can guess the output):

```
>>> i = Integer(9)
>>> j = Integer(10)
>>> i.show(); j.show()
member function begins execution
9
member function ends execution
member function begins execution
10
member function ends execution
```

`Python`

programmers who have familiarity with the standard concepts and syntax of Python. In this three parts tutorial, we will specifically look at three features of Python namely `Generators`

, `Decorators`

and `Context Managers`

which, in my opinion, are not heavily used by average or below-average python programmers. In my experience, these features are lesser known to programmers whose primary purpose for using python is not to focus on the language too much but just to get their own applications/algorithms working. This leads to very monotonous, imparative-style codes which, in long run, become unmaintainable.
**Python** is unarguably the most versatile and easy-to-use language ever created. Today, python enjoys a huge userbase spread accross different fields of science and engineering. The reason for such a level of popularity is unarguably because of the *dynamic nature* of Python. It is almost the opposite of bare-metal languages like `C++`

, which are known to be *stongly typed*. Python is able to achieve this *dynamicity* by the virtue of very careful and elegent design decisions made by the creators of Python in the early days of development. At the end of this tutorial, the reader is expected to get a feel of Python’s dynamicity.

Part-I will be all about `Generators`

`Generator`

s :Here is a motivating example - a very basic program

```
for i in [0,1,2,3,4,5,6,7,8,9]:
print(i)
```

In the above program, a “list object” is created in memory containing all the elements and then travarsed as the loop unrolls. The problem with such approach is the space requirement for the list object - specially if the list is large. An efficient code would never leave a chance to take advantage of the fact that consecutive elements of the list are *logically related*. In this case, the logical relation being `list[i+1] = list[i] + 1`

.

Without knowing anything about `Generator`

s, one can come up with an efficient solution of this problem by setting up a *generation process* which will *generate* one element at a time using the logical relation. It may sound complecated at first, but it’s as easy as this:

```
i = 0
while i < 10:
print(i)
i += 1
```

The way *Generators* work, is no different than this. The only thing you need to know is the *syntactical formalism*. And here, Python introduces a new keyword called `yield`

. Without complicating things at this moment, the primary purpose of `yield`

is to “halt the execution of a function (somewhere) in the middle while keeping it’s *state* intact”. The *state* of a function at a certain point of time refers to the objects (names and values) present in it’s immediate scope.

So, let’s write a little *Generator* (using `yield`

) for our previous example and then describe how this definition of `yield`

solves our problem.

So, here it is:

```
def generator(upto):
i = 0
while i < upto:
yield i
i += 1
```

See, it is basically the same code, just wrapped in a function (*that’s important*). The other difference is to use `yield i`

instead of `print(i)`

. It is to *offload* the usage of the generated elements to the caller/client who requested the generation. `yield i`

basically does two things - it returns it’s argument (i.e. the value of `i`

in this case) and halts the execution at that `yield`

statement.

Although it looks like a normal function, but the invocation of Generator is a little different. Using the `yield`

keyword *anywhere* inside a function automatically makes it a Generator. Calling the function with proper arguments will return a `Generator object`

```
>>> g = generator(10)
>>> g
<generator object generator at 0x7f3cc9219fc0>
```

and then, the caller/client has to *request for generating one element* at a time like

```
>>> next(g) # next(...) is a built-in function
0
>>> next(g)
1
>>> g.__next__() # same as "next(g)"
2
```

Starting from the beginning, everytime `next(g)`

or `g.__next__()`

is called, the function keeps executing the code normally until one `yield`

is encountered. After encountering an `yield`

, the argument of the `yield`

is returned as a *return value* of `next(g)`

(or `g.__next__()`

) and waits for another invocation of `next(g)`

. So, the generation process remains alive because of the `while`

loop inside the Generator. Although it is totally possible to NOT have a loop in a Generator at all. You may have code like this as well:

```
def generates_three_elems():
yield 1 # <- returns 1 and execution halts for the first time
yield 2 # <- returns 2 and execution halts for the second time
yield 3 # <- returns 3 and execution halts for the third time
```

The obvious question now is, “What will happen when the `while`

loop finishes and the control flow exits the Generator function ?”. This is precisely what is used as the *condition of exhaustion* of the Generator. Python throws a `StopIteration`

exception when the Generator exhausts. The caller/client code is supposed to intercept the exception:

```
g = generator(10)
gen_exhausted = False
while not gen_exhausted:
try:
elem = next(g)
# use the generated elements
do_something_with( elem )
except StopIteration as e:
gen_exhausted = True
```

OR, equivalently, the caller/client code may use Python’s native `foreach`

construct which internally takes care of the exception handling

```
for elem in generator(10):
print(elem)
```

Both versions will produce the same output:

```
0
1
2
3
4
5
6
7
8
9
```

`Generator`

:There are couple of lesser known usage of the `yield`

keyword, one of them being a way to *interfere/poke into* the generation process. Essentially, `yield`

can be used to introduce caller/client specified object(s) into a generation request. Here is the code:

```
def generator(upto):
i = 0
while i < upto:
r = yield
yield i + r
i += 1
```

The way to generate elements now is:

```
>>> g = generator(10)
>>> next(g); g.send(0.1234)
0.1234
>>> next(g); g.send(0.4312)
1.4312
>>> next(g); g.send(0.5)
2.5
```

Let me explain. The `yield`

keyword in the `r = yield`

statement will evaluate to be the object sent into the generator using `g.send(...)`

. Then the value of `r`

is added to `i`

and then `yield`

ed as usual which comes out of the generator via the `.send(...)`

method. Also notice that we now have to make some extra effort of executing one `next(g)`

before we can get the element from `.send(...)`

; it is because the `yield i + r`

statement halts the execution but we need to get to the next `r = yield`

statement before `g.send(...)`

can be executed. So basically, that extra `next(g)`

advances the control flow from one `yield i + r`

statement of one iteration to the `r = yield`

statement of the next iteration.

`Exception`

s:Instead of `.send(...)`

ing object(s) into the generation process, you can send an `Exception`

and blow it up from inside. The `g.throw(...)`

is to be used here:

```
>>> g = generator(10)
>>> next(g)
>>> g.throw(StopIteration, "just stop it")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in generator
StopIteration: just stop it
```

What `g.throw(NameOfException, ValueOfException)`

does is, it (somehow) penetrates the generator body and replaces the `r = yield`

statement with `raise <NameOfException>(<ValueOfException)`

which blows up the generator and the exception propagates out of it as usual. As you might have guessed, it is possible to catch the exception by building an exception handler around `r = yield`

, like so:

```
def generator(upto):
i = 0
while i < upto:
try:
r = yield
except StopIteration as e:
print('an exception was caught')
yield i + r
i += 1
```

`Generator`

s:Another lesser known but fairly advance usage of `yield`

keyword is to *transfer/delegate* generation process to another generator(s).

```
def generator(upto):
i = 0
while i < upto:
yield i
i += 1
def delegating_gen():
yield from generator(5)
yield from generator(3)
```

Look at that new `yield from`

syntax. It does exactly what it literally means. Invoking `delegating_gen()`

will create a Generator object which, on generation request, will generate from `generator(5)`

first and then hop onto generating from `generator(3)`

. As you might have guessed, the `delegating_gen()`

function will be (internally) converted into something like this:

```
def delegating_gen():
for elem in generator(5):
yield elem
for elem in generator(3):
yield elem
```

Both versions of `delegating_gen()`

above will produce the same result:

```
>>> for e in delegating_gen():
print(e)
0 # <- generation starts from "generator(5)"
1
2
3
4 # <- "generator(5)" exhausts here
0 # <- generation starts from `generator(3)`
1
2 # <- "generator(3)" exhausts here
```

`__next__(..)`

and `__iter__(..)`

methods - The A regular class can also be set up to behave like a Generator. A class in Python is a generator if it follows the **iterator protocol** which expects it to implement two specific methods - `__next__(self)`

and `__iter__(self)`

. The `__next__(self)`

method is the way to get one element out of the generator and the `__iter__(self)`

methods acts as a switch to *start/reset* the generator. Here is how it works:

```
class Series:
def __init__(self, upto):
self.i = -1
self.upto = upto
def __iter__(self):
self.i = -1
return self
def __next__(self):
self.i += 1
return self.i
>>> s = Series(10)
>>> s = iter(s) # starts the generator
>>> next(s) # generates as usual
0
>>> next(s)
1
>>> next(s)
2
```

As one can easily infer from the code snippet that our familiar `next(..)`

built-in function essentially calls `.__next__(..)`

member function of the object and the newly introduced `iter(..)`

built-in function calls the `.__iter__(..)`

function.

Almost all real life generator classes have a very similar `__iter__()`

function. All it has to do is reset the state of the object and returns itself (`self`

). It looks more or less like this:

```
class Generator:
# .. __init__() and __next__() as usual
def __iter__(self):
# resets the state of the object
# ...
return self # almost always
```

That is pretty much all I had to say about `Generators`

. Upcoming Part II will be all about `Decorators`

.

BibTeX citation:

```
@online{das2018,
author = {Das, Ayan},
title = {Intermediate {Python:} {Generators,} {Decorators} and
{Context} Managers - {Part} {I}},
date = {2018-11-25},
url = {https://ayandas.me//blogs/2018-11-25-intermediate-python-tutorial.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2018. “Intermediate Python: Generators, Decorators and
Context Managers - Part I.” November 25, 2018. https://ayandas.me//blogs/2018-11-25-intermediate-python-tutorial.html.

`capsule`

s proposed by Geoffrey Hinton and colleagues which created a buzz in the deep learning community. In that article, I explained in simple terms the motivation behind the idea of `capsule`

s and its (minimal) mathematical formalism. It is highly recommended that you read that article as a prerequisite to this one. In this article, I would like to explain the specific `CapsNet`

architecture proposed in the same paper which managed to achieve state-of-the-art performance on the MNIST digit classification.
The architecture diagram in the paper is pretty much a good representation of the model. But I’ll still try to make things easier by explaining it part-by-part. If you have gone through my previous article, you should have least complication in understanding the architecture. The `CapsNet`

architecture is composed of 3 layers and they are as follows:

**First convolutional layer**:*This is an usual convolutional layer*. In case of MNIST digits, a single digit image of shape`28 x 28`

is convolved by 256 kernels of shape`9 x 9`

. In the paper, the authors decided not to zero-pad the inputs to keep the feature map dimensions same. So, the output of this layer is 256 feature maps/activation maps of shape`20 x 20`

.*ReLU*has been used as the activation function.

**Second convolutional layer**or the**PrimaryCaps layer**: For a clear understanding, I am breaking this layer into two parts:*This is just another convolutional layer*which applies 256 convolutional kernels of shape`9 x 9`

(no zero-padding as the first one) and stride 2 which produces 256 activation maps of`6 x 6`

.The output of the second convolutional layer (

`6 x 6 x 256`

) is interpreted as a set of 32 “*capsule activation maps*” with capsule dimension 8. By “*capsule activation map*” I mean an activation map of capsules instead of scalar-neurons. The below diagram depicts these capsule activation maps quite clearly. So, we have a total of`6*6*32 = 1152`

capsules (each of dimension 8) which are then flattened on a capsule level to make an array of 1152 capsules. Finally, each capsule is applied through a vector non-liearity.

A

**capsule-to-capsule layer**or**DigitCaps layers**: This layer is exactly what I explained in the last part of my previous post. The 1152 (lower level) capsules are connected to 10 (higher levels) capsules which has a total of`1152*10 = 11520`

weight matrices . The 10 higher level capsules (of dimension 16) represent the 10 final “*digit/class entities*”. Vector non-linearity is again applied on these capsules so that the lengths can be treated as probability of existence of a particular digit entity. This layer also has the “*dynamic routing*” in it.

To summarize, an image of shape `28 x 28`

when passed through the `CapsNet`

, will produce a set of 10 capsule activations (each of dimension 16) each of which represents the “*explicit pose parameters*” of the digit entity (the class) it is associated with.

The objective function used for training the `CapsNet`

is well known to the machine learning community. It is a popular variation of the “Hinge Loss”, namely “*Squared Hinge loss*” or “*L2-SVM loss*”. I do not intend to explain the loss function in detail, you may want to check this instead. The basic idea is to calculate the lengths (probability of existence - between 0 and 1) of the 10 digit capsules and maximizing the one corresponding to the label while minimizing the rest of them.

The loss for the sample is

where, is the set of classes, is the length of the digit capsule activation of class for the sample and is the “*one hot*” encoding of the label of sample . and are the margins of the margin loss which have been taken to be 0.9 and 0.1 respectively. decays the loss for the absent digit classes.

The authors of the paper decided to use a regularizer in the training process which is basically a “*decoder network*” that reconstructs the input digit images from the (16 dimensional) activity vector of its corresponding class-capsule. It is a simple 3-layer fully connected network with *ReLU* activations in the 2 hidden layers and *Sigmoid* in the last layer. The reconstructed vector is the flattened image of size 784. The dimensions of the layers of the decoder network are shown in the figure below.

Link to my full implementation: https://github.com/dasayan05/capsule-net-TF

Let’s dive into some code. Things will make sense as we move on. I am using `python`

and the `tensorflow`

library to create a static computation graph that represents the `CapsNet`

architecture.

Let’s start by defining the placeholders - a *batch* of MNIST digit images (of shape `batch x image_w x image_h x image_c`

) and the *one hot* encoded labels (of shape `batch x n_class`

). An extra (batch) dimension on axis 0 will always be there as we intend to do batch learning.

```
x = tf.placeholder(tf.float32, shape=(None,image_w,image_h,image_c), name='x')
y = tf.placeholder(tf.float32, shape=(None,n_class), name='y')
```

The shape of `x`

has been chosen to be compatible with the convolutional layer API (`tf.layers.conv2d`

). The two successive `conv2d`

s are as follows

```
conv1_act = tf.layers.conv2d(x, filters=256, kernel_size=9, strides=1, padding='VALID',
kernel_initializer=tf.contrib.layers.xavier_initializer(),
activation=tf.nn.relu,
use_bias=True, bias_initializer=tf.initializers.zeros) # shape: (B x 20 x 20 x 256)
primecaps_act = tf.layers.conv2d(conv1_act, filters=8*32, kernel_size=9, strides=2, padding='VALID',
kernel_initializer=tf.contrib.layers.xavier_initializer(),
activation=tf.nn.relu,
use_bias=True, bias_initializer=tf.initializers.zeros) # shape: (B x 6 x 6 x 256)
```

Now that we have 256 `primecaps`

activation maps (of shape `6 x 6`

), we have to arrange them in a set of 32 “*capsule activation maps*” with capsules dimension 8. So we `tf.reshape`

and `tf.tile`

it. The tiling is required for simplifying some future computation.

```
primecaps_act = tf.reshape(primecaps_act, shape=(-1, 6*6*32, 1, 8, 1)) # shape: (B x 1152 x 1 x 8 x 1)
# 10 is for the number of classes/digits
primecaps_act = tf.tile(primecaps_act, [1,1,10,1,1]) # shape: (B x 1152 x 10 x 8 x 1)
```

Next, we apply vector non-linearity (the `squashing`

function proposed in the paper)

```
def squash(x, axis):
# x: input tensor
# axis: which axis to squash
# I didn't use tf.norm() here to avoid mathamatical instability
sq_norm = tf.reduce_sum(tf.square(x), axis=axis, keep_dims=True)
scalar_factor = sq_norm / (1 + sq_norm) / tf.sqrt(sq_norm + eps)
return tf.multiply(scalar_factor, x)
# axis 3 is the capsule dimension axis
primecaps_act = squash(primecaps_act, axis=3) # squashing won't change shape
```

As we have total 1152 capsule activations (all squashed), we are now ready to create the capsule-to-capsule layer or the `DigitCaps`

layer. But we need the affine transformation parameters () to produce the “*prediction vectors*” for all and .

```
W = tf.get_variable('W', dtype=tf.float32, initializer=tf.initializers.random_normal(stddev=0.1),
shape=(1, 6*6*32, 10, 8, 16)) # shape: (1, 1152, 10, 8, 16)
# bsize: batch size
W = tf.tile(W, multiples=[bsize,1,1,1,1]) # shape: (B x 1152 x 10 x 8 x 16)
```

Calculate the prediction vectors. Applying `tf.matmul`

on `W`

and `primecaps_act`

will matrix-multiply the last two dimensions of each tensor. The last two dimensions of `W`

and `primecaps_act`

are `16 x 8`

(because of `transpose_a=True`

option) and `8 x 1`

respectively.

```
u = tf.matmul(W, primecaps_act, transpose_a=True) # shape: (B x 1152 x 10 x 16 x 1)
# reshape it for routing
u = tf.reshape(tf.squeeze(u), shape=(-1, 6*6*32, 16, 10)) # shape: (B x 1152 x 16 x 10)
```

Now, its time for the routing. We declare the logits as `tf.constant`

so that they re-initialize on every `sess.run()`

call or in other words, on every batch. After `R`

iterations, we get the final .

```
# bsize: batch size
bij = tf.constant(zeros((bsize, 6*6*32, 10), dtype=float32), dtype=tf.float32) # shape: (B x 1152 x 10)
for r in range(R):
# making sure sum_cij_over_j is one
cij = tf.nn.softmax(bij, dim=2) # shape: (B x 1152 x 10)
s = tf.reduce_sum(u * tf.reshape(cij, shape=(-1, 6*6*32, 1, 10)),
axis=1, keep_dims=False) # shape: (B x 16 x 10)
# v_j = squash(s_j); vector non-linearity
v = squash(s, axis=1) # shape: (B x 16 x 10)
if r < R - 1: # bij computation not required at the end
# reshaping v for further multiplication
v_r = tf.reshape(v, shape=(-1, 1, 16, 10)) # shape: (B x 1 x 16 x 10)
# the 'agreement'
uv_dot = tf.reduce_sum(u * v_r, axis=2)
# update logits with the agreement
bij += uv_dot
```

That’s all for the forward pass. All that is left is defining the loss and attaching an optimizer to it. We calculate the classification loss (the squared-hinge loss) after computing the lengths of the 10 digit/class capsules.

```
v_len = tf.sqrt(tf.reduce_sum(tf.square(v), axis=1) + eps) # shape: (B x 10)
MPlus, MMinus = 0.9, 0.1
# this is very much similar to the actual mathematical formula I showed earlier
l_klass = y * (tf.maximum(zeros((1,1),dtype=float32), MPlus-v_len)**2) + \
lam * (1-y) * (tf.maximum(zeros((1,1),dtype=float32), v_len-MMinus)**2)
# take mean loss over the batch
loss = tf.reduce_mean(l_klass)
# add an optimizer of your choice
optimizer = tf.train.AdamOptimizer()
train_step = optimizer.minimize(loss)
```

I am not showing the reconstruction loss here in this article but my actual implementation does have the reconstruction loss. You should refer to it for any other confusion.

BibTeX citation:

```
@online{das2017,
author = {Das, Ayan},
title = {CapsNet Architecture for {MNIST}},
date = {2017-11-26},
url = {https://ayandas.me//blogs/2017-11-26-capsnet-architecture-for-mnist.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2017. “CapsNet Architecture for MNIST.” November
26, 2017. https://ayandas.me//blogs/2017-11-26-capsnet-architecture-for-mnist.html.

`capsules`

), which he thinks is a better model of the human brain. In this post, I will try to present an intuitive explanation of this new proposal by Hinton and colleagues.
In the ongoing renaissance of deep learning, if we have to chose one specific neural network model which has the most contribution, it has to be the Convolutional Neural Networks or ConvNets as we call. Popularized by Yann LeCun in the 90’s, these models have gone through various modifications or improvements since then - be it from theoretical or engineering point-of-view. But the core idea remained more or less same.

The recent fuss about `capsules`

really started just after the publication of the paper named Dynamic Routing Between Capsules by Sara Sabour, Nicholas Frosst and Geoffrey Hinton. But, it turned out that hinton had this idea way back in 2011 (Transforming Auto-encoders) but for some reason it didn’t catch much attention. This time it did. Equipped with the idea of capsules, hinton and team also achieved state-of-the-art performance on MNIST digit classification dataset.

This article is roughly divided into 3 parts:

- Description of the normal ConvNet model
- What hinton thinks is wrong in it
- The new idea of capsules

I will briefly discuss ConvNets here. If you are not well aware of the nitty-gritties of ConvNet, I would suggest reading a more detailed article (like this).

Convolutional neural networks are specially designed to exploit the 2D structure of images. Usual ConvNet architectures (Figure-1) have two core operations

Convolution operation or filtering is running a 2D kernel (usually of size quite smaller than the image itself) spatially all over an image which looks for a specific pattern in the image and generates an activation map (or feature map) which shows the locations where it was able to spot the pattern (Figure-2).

Most frequently used type of pooling, i.e. Max-pooling is used to reduce the size of the feature maps/activation maps (Figure-3) for computational benefit. *But*, it has one other purpose (which is exactly what Geoff Hinton has a problem with). The max-pooling is supposed to induce a small translation invariance in the learning process. If an entity in the image is translated by a small amount, the activation map corresponding to that entity will shift equally. But, the max-pooled output of the activation map remains unaltered.

After stacking multiple convolution and pooling layers, usually all the neurons are flattened into one dimensional array and fed into a multilayer perceptron of proper depth and width.

Trained with a supervised learning procedure, the network will be able to produce different levels of abstracted representation of a given image. For example, trained on a dataset with lots of facial images, a convnet will possibly learn to detect lower level features like edges, corners in its earliest layers. The layers above that will learn to detect smaller facial parts like eyes, noses etc. And the top most layer will be detecting whole faces.

In the above illustration, the lower level convolutional layer is detecting facial parts and the layer above (the next convolutional layer) is detecting faces with the help of information from the layer below. The “*white dots*” in the image denote high responses in the activation map indicating a possible presence of the entity it was searching for. One thing to note, the above illustration is only a pictorial representation and does not *exactly* depict what a ConvNet learns in reality. The two convolutional layers in the figure can be any two successive layers in a deeper convnet.

Geoffrey Hinton, in his recent talks on `capsules`

, argued that the present from of ConvNet is not a good representation of what happens in the human brain. Particularly, his disagreement is over the way ConvNet solves the `invariance problem`

. As I earlier explained, **pooling** is primarily responsible for inducing translation invariance. The argument against pooling as stated by Hinton himself, is > max-pooling solves the wrong problem - we want **equivariance**, not **invariance**

Consider the network I showed in the last figure. Now we have training samples same as the one shown in the figure and also its translated version - translated by a significant amount so that it is well beyond the capability of the max-pooling layer to have the same activation map. In this case, the network will have **two different neurons** detecting the same face in two different locations.

If we consider the output of the flatten layer to be a **codified** representation of the image which we often do, it will be clear that the network is **invariant** to translation, i.e., the “*two neurons*” togather is able to predict whether a face is present in either location of the image. The problem with this approach is not only the fact that it is now the responsibility of multiple neurons to detect a single entity but also it is losing “*explicit pose information*” (phrase coined by hinton). Although one must realize that there are implicit informations in the network in the form of “*relative activations of neurons*” about the location of the entity (face in this case) in image. This is what neuroscientists call the `place-coding`

.

The brain also represents an image with several layers of abstraction but according to Hinton’s hypothesis, it learns to detect entities in each layer with **equivariance**. What it means is, the brain contains some “*modules/units*” for detecting different entities - **just one for each entity**. Such modules/units have the ability to “*explicitly represent*” the “*pose*” of an entity. This is called the `rate-coding`

. Clearly, a scalar neuron is not enough to avail such representational power.

Geoff Hinton often draws a comparison between “*human vision system*” with “*computer graphics*” by saying > human brain does the opposite of computer graphics

I will try to explain what he means by that. In a typical computer graphics rendering system we present a 3D model with a set of vertices/edges (which we often call a **mesh**) and it gets converted into a 2D image which can then be visualized on computer screens. How humans do visual perception is pretty much the opposite - it figures out the “*explicit pose informations*” about an entity from the 2D image and reverts it to get the **mesh** back. This notion of extracting “*explicit pose information*” resembles quite well with the new “modules/units” that I have talked about in the previous section.

As I stated earlier, a neuron-model that spits out a scalar value is certainly not enough to represent explicit pose of an entity. This is where Hinton and team came up with the idea of `capsules`

which is nothing but an *extension to the familiar neuron-model*. > A `capsule`

is basically a “**vector-neuron**” which takes in a bunch of vectors and produces a single vector. They are simply “**vector-in-vector-out**” computation units.

These are the mathematical notations I’m gonna use here onwards:

- : the input vector of dimension for
- : the output vector of dimension for
- : the weight matrix between capsule of layer and capsule of layer *: the coupling coefficients between capsule of layer and capsule of layer . I’ll come to this later in detail.

The pre-activation of capsule is given by $ $ where . Then the activation is given by where is a “*vector nonlinearity*”. The is called “*prediction vector*” from capsule of layer to capsule of layer .

The authors of the paper presented one particular vector non-linearity $ $ which worked in their case but it certainly is not the only one. Considering the fast pace in which the deep learning community works, it won’t take too long to come up with new and improved vector non-linearities.

Equipped with the mathematical model of `capsules`

, we now have a way to represent “*explicit pose parameters*” of an entity. One **capsule** in any layer is a *dedicated* unit for detecting a single entity. If the **pose** of the entity changes (shifts, rotates, etc.), the same capsule will remain responsible for detecting that entity just with a change in its activity vector (). For example, a capsule detecting a face might output an activity vector which when rotated outputs . Additionally, the length of the vector (euclidean distance of the vector from origin) represents the probability of existence of that entity. The kind of vector non-linearity used in the capsule model will ensure that the length of vector lie between 0 and 1 (interpreted as probability).

Although we now have a structurally different neuron model, two consecutive `capsule layers`

will still learn two levels of abstracted representation just like normal ConvNet.

But as we now have more representation power in a single neuron (namely capsule), we should exploit it to ensure a *meaningful information flow* between two neurons of successive layers. Such parametric structure ( and ) of capsule has been carefully designed to do exactly that. The reader is advised to take extra care in understanding the next two sub-sections because they are the *heart* of the idea of capsules.

is a model parameter and learned by a supervised training process just like the connections in the normal neural networks. But, they have a quite different interpretation than that of the scalar neurons. Activities () of each capsule in the lower level gets matrix-multiplied by and produces what the authors of the paper call “**prediction vectors**” (). There is a reason for such a name.

basically performs an “

affine transform” on the incoming capsule activations ( - pose parameters of lower level entities) and makes a “guess” about what the activities of the higher level capsules () could be.

In our running example of face detection, the activity vector (pose parameters) of the eye detector capsule (let’s call it ) makes a prediction which is nothing but a guess of , the activity of the face detector capsule (i.e., the pose of the face) in the next layer.

As shown in the illustration above, the pose vectors of the eye, nose and mouth (, and ) detector capsules in the lower level layer individually predicted possible poses of the face( , and ). The mouth detector in the figure needs some extra attention as it has predicted the face pose which does not “*aligns*” with the real face pose in the input image. To sum it up, the parameters models a “**part-whole**” relationship between the lower and higher level entities (or capsules).

If you understood the previous sub-section properly, it should be clear by now that there can be some lower level prediction vectors which won’t “*align*” with the higher level capsule activities indicating that they are not related by a “**part-whole**” relationship. In the paper “Dynamic Routing Between Capsules” the authors have proposed a way to “*de-couple*” the connection between such capsules.

The process is fairly simple: take a prediction vector from the lower level and an activity vector from next level and measure how much they agree by computing an “**agreement**” quantity . Judging by the values of $ $, we can then “*strengthen*” or “*weaken*” the correspodning connection strength by highering or lowering appropriately.

The process can be thought of collecting “*high-dimensional votes*” from all the capsules below and matching it with the top level capsule and de-coupling the ones which don’t agree with it. By “*fading away*” the incoming connections that don’t agree, we enforce the connection parameters () to learn more prominent “**part-whole**” relationships. In the figure below, the incoming “prediction vectors ()” for a specific is shown for two consecutive iterations. The intensity of the grey color denotes the “*degree of contribution*” of a particular towards the .

Having the “**prediction vectors** ()” available, the exact way of doing dynamic routing is as follows:

- set for a single training sample
- for :

- compute over all
- compute and according to the capsule-model equations
- compute “
**agreement**” between and - update

- take at the end of routing iterations

Now couple of things to note here:

- is the true coupling coefficient. Having an intermediate makes sure
- We compute the for each training sample. is not a global model parameter, it resets to initial for each sample whlie training
- We execute the routing in the process of computing . is the end product of iterations of routing
- We usually take to be

So, that brings us to the end of the general discussion on capsules. In the next article, I will explain the specific **CapsNet** architecture (with tensorflow implementation) that has been used for MNIST digit classification task. See you.

BibTeX citation:

```
@online{das2017,
author = {Das, Ayan},
title = {An Intuitive Understanding of {Capsules}},
date = {2017-11-20},
url = {https://ayandas.me//blogs/2017-11-20-an-intuitive-understanding-of-capsules.html},
langid = {en}
}
```

For attribution, please cite this work as:

Das, Ayan. 2017. “An Intuitive Understanding of Capsules.”
November 20, 2017. https://ayandas.me//blogs/2017-11-20-an-intuitive-understanding-of-capsules.html.