In the previous article, I started with Directed Probabilitic Graphical Models (PGMs) and a family of algorithms to do efficient approximate inference on them. Inference problems in Directed PGMs with continuous latent variables are intractable in general and require special attention. The family of algorithms, namely Variation Inference (VI), introduced in the last article is a general formulation of approximating the intractable posterior in such models. Variational Autoencoder or famously known as VAE is an algorithm based on the principles on VI and have gained a lots of attention in past few years for being extremely efficient. With few more approximations/assumptions, VAE eshtablished a clean mathematical formulation which have later been extended by researchers and used in numerous applications. In this article, I will explain the intuition as well as mathematical formulation of Variational Autoencoders.
Variational Inference: A recap
A quick recap would make going forward easier.
Given a Directed PGM with countinuos latent variable and observed variable , the inference problem for turned out to be intractable because of the form of its posterior
To solve this problem, VI defines a parameterized approximation of , i.e., and formulates it as an optimization problem
The objective can further be simplified as
is precisely the objective we maximize. The can best be explained by decomposing it into two factors. One of them takes care of maximizing the expected conditional log-likelihood (of the data given latent) and the other arranges the latent space in a way that it matches a predifined distribution.
For a detailed explanation, go through the previous article.
Variational Autoencoder
Variational Autoencoder (VAE) is first proposed in the paper titled “Auto-Encoding Variational Bayes” by D.P.Kingma & Max Welling. The paper proposes two things:
A parameterized inference model instead of just
The reparameterization trick to achieve efficient training
As we go along, I will try to convey the fact that these are essentially developments on top of the general VI framework we learnt earlier. I will focus on how each of them is related to VI in the following (sub)sections.
The “Inference Model”
Fig.1. Subfig.1: The Bayesian Network defining VAE. Subfig.2: The forward pass (abstarct) of VAE. Subfig.3: The forward pass of VAE with explicit sampling shown at the end of encoder
The idea is to replace the generically parameterized in the VI framework by a data-driven model , named as Inference model. What does it mean ? It basically means, we are no longer interested in the unconditional distribution on but instead we want to have a conditional distribution on given observed data. Please recall our “generative view” of the model
With the inference model in hand, we now have an “inference view” as follows
It means, we can do inference just by ancestral sampling after our model is trained. Of course, we don’t know the real , so we consider a parameterized approximation as I already mentioned.
These two “views”, when combined, forms the basis of Variational Autoencoder (See Fig.1: Subfig.1).
The “combined model” shown above gives us insight about the training process. Please note that the model starts from (a data sample from our dataset) - generates via the Inference model - and then maps it back to again using the Generative model (See Fig.1: Subfig.2). I hope the reader can now guess why its called an Autoencoder ! So, we clearly have a computational advantage here: we can perform training on per-sample basis; just like Inference. This is not true for many of the approximate inference algorithms of pre-VAE era.
So, succinctly, all we have to do is a “forward pass” through the model (yes, the two sampling equations above) and maximize where is a sample we got from the Inference model. Note that we need to parameterize the generative model as well (with ). In general, we almost always choose and as a fully-differentiable functions like Neural Network (See Fig.1: Subfig.3 for a cleaner diagram). Now we go back to our objective function from VI framework. To formalize the training objective for VAE, we just need to replace by in the VI framework (please compare the equations with the recap section)
And the objective
Then,
As usual, is a chosen distribution which we want the structure of to be; which is often Standard Gaussian/Normal (i.e., )
The specific parameterization of reveals that we predict a distribution in the forward pass just by predicting its parameters.
The first term of is relatively easy, its a loss function that we have used a lot in machine learning - the log-likelihood. Very often it is just the MSE loss between the predicted and original data . What about the second term ? It turns out that we can have closed-form solution for that. Because I don’t want unnecessary maths to clutter this post, I am just putting the formula for the readers to look at. But, I would highly recommend looking at the proof in Appendix B of the original VAE paper. Its not hard, believe me. So, putting the proper values of and into the KL term, we get
Please note that are the individual dimensions of the predicted mean and std vector. We can easily compute this in forward pass and add it to the log-likelihood (first term) to get the full (ELBO) loss.
Okay. Let’s talk about the forward pass in a bit more detail. Believe me, its not as easy as it looks. You may have noticed (Fig.1: Subfig.3) that the forward pass contains a sampling operation (sampling from ) which is NOT differentiable. What do we do now ?
The reparameterization trick
Fig.1. Subfig.1: The full forward pass. Subfig.2: The full forward pass with reparameterized sampling.
I showed before that in forward pass, we get the by sampling from our parameterized inference model. Now that we know the exact form of the inference model, the sampling will look something like this
The idea is basically to make this sampling operation differentiable w.r.t and . In order to do this, we pull a trick like this
This is known as the “reparameterization”. We basically rewrite the sampling operation in a way that separates the source of randomness (i.e., ) from the deterministic quantities (i.e., and ). This allows the backpropagation algorithm to flow derivatives into and . However, please note that it is still not differentiable w.r.t but .. guess what .. we don’t need it ! Just having derivatives w.r.t and is enough to flow it backwards and pass it to the parameters of inference model (i.e., ). Fig.2 should make everything clear if not already.
Wrap up
That’s pretty much it. To wrap up, here is the full forward-backward algorithm for training VAE:
Given from the dataset, compute .
Compute a latent sample as
Compute the full loss as .
Update parameters as
Repeat.
Citation
BibTeX citation:
@online{das2020,
author = {Das, Ayan},
title = {Foundation of {Variational} {Autoencoder} {(VAE)}},
date = {2020-01-01},
url = {https://ayandas.me/blogs/2020-01-01-variational-autoencoder.html},
langid = {en}
}