Energy Based Models (EBMs): A comprehensive introduction

We talked extensively about Directed PGMs in my earlier article and also described one particular model following the principles of Variational Inference (VI). There exists another class of models conveniently represented by Undirected Graphical Models which are practiced relative less than modern methods of Deep Learning (DL) in the research community. They are also characterized as Energy Based Models (EBM), as we shall see, they rely on something called Energy Functions. In the early days of this Deep Learning renaissance, we discovered few extremely powerful models which helped DL to gain momentum. The class of models we are going to discuss has far more theoretical support than modern day Deep Learning, which as we know, largely relied on intuition and trial-and-error. In this article, I will introduce you to the general concept of Energy Based Models (EBMs), their difficulties and how we can get over them. Also, we will look at a specific family of EBM known as Boltmann Machines (BM) which are very well known in the literature.

Undirected Graphical Models

The story starts when we try to model a number of Random Variables (RVs) in the graph but we only have a weak knowledge about which variables are related but not the direction of influence. Direction is a necessary requirement for Directed PGMs. For example, let’s consider a lattice of atoms (Fig.1(a)) where only neighbouring atoms influence the spins but it is unclear what the direction of the influences are. For simplicity, we will use a simpler model (Fig.2(b)) for demonstration purpose.

Fig.1: (a) An atom lattice model. (b) An arbitrary undirected model.

We model a set of random variables $X$ (in our example, ${A, B, C, D}$ ) whose connections are defined by graph $G$ and have “potential functions” defined on each of its maximal cliques $Q \in Cliques (G)$ . The total potential of the graph is defined as

$Φ (x) = \prod_{Q \in Cliques (G)} ϕ_{Q} (q)$

$q$ is an arbitrary instantiation of the set of RVs denoted by $Q$ . The potential functions $ϕ_{Q} (q) \in R_{> 0}$ are basically “affinity” functions on the state space of the cliques, e.g. given a state $q$ of a clique $Q$ , the corresponding potential function $ϕ_{Q} (q)$ returns the viability of that state OR how likely that state is. Potential functions are somewhat analogous to conditional densities from Directed PGMs, except for the fact that potentials are arbitrary non-negative values. They don’t necessarily sum to one. For a concrete example, the example graph in Fig.1(b) can thus be factorized as $Φ (a, b, c, d) = ϕ_{{A, B, C}} (a, b, c) \cdot ϕ_{{A, D}} (a, d)$ . If we assume the variables ${A, D}$ are binary RVs, the potential function corresponding to that clique, at its simplest, could be a table like this:

$ϕ_{{A, D}} (a = 0, d = 0) = + 4.00$ $ϕ_{{A, D}} (a = 0, d = 1) = + 0.23$ $ϕ_{{A, D}} (a = 1, d = 0) = + 5.00$ $ϕ_{{A, D}} (a = 1, d = 1) = + 9.45$

Just like every other model in machine learning, the potential functions can be parameterized, leading to

$\begin{matrix} (1) & Φ (x; Θ) = \prod_{Q \in Cliques (G)} ϕ_{Q} (q; Θ_{Q}) \end{matrix}$

Semantically, potentials denotes how likely a given state is. So, higher the potential, more likely that state is.

Reparameterizing in terms of Energy

When we are defining a model, however, it is inconvenient to choose a constrained (non-negative valued) parameterized function. We can easily reparameterize each potential function in terms of energy functions $E_{Q}$ where

$\begin{matrix} (2) & ϕ_{Q} (q, Θ_{Q}) = \exp (- E_{Q} (q; Θ_{Q})) \end{matrix}$

The $\exp (\cdot)$ enforces the potentials to be always non-negative and thus we are free to choose an unconstrained energy function. One question you might ask - “why the negative sign ?”. Frankly, there is no functional purpose of that negative sign. All it does is reverts the semantic meaning of the parameterized function. When we were dealing in terms of potentials, a state which is more likely, had higher potential. Now its opposite - states that are more likely have less energy. Does this semantics sound familiar ? It actually came from Physics where we deal with “energies” (potential, kinetic etc.) which are bad, i.e. less energy means a stable system.

Such reparameterization affects the semantics of Eq.1 as well. Putting Eq.2 into Eq.1 yields

$\begin{aligned} Φ (x; Θ) & = \prod_{Q \in Cliques (G)} \exp (- E_{Q} (q; Θ_{Q})) \\ (3) & = \exp (- \sum_{Q \in Cliques (G)} E_{Q} (q; Θ_{Q})) = \exp (- E_{G} (x; Θ)) \end{aligned}$

Here we defined $E_{G} (x; Θ) ≜ \sum_{Q \in Cliques (G)} E_{Q} (q; Θ_{Q})$ to be the energy of the whole model. Please note that the reparameterization helped us to convert the relationship between individual cliques and whole graph from multiplicative (Eq.1) to additive (Eq.3). This implies that when we design energy functions for such undirected models, we design energy functions for each individual cliques and just add them.

All this is fine .. well .. unless we need to do things like sampling, computing log-likelihood etc. Then our energy-based parameterization fails because its not easy to incorporate an un-normalized function into probabilistic frameworks. So we need a way to get back to probabilities.

Back to Probabilities

The obvious way to convert un-normalized potentials of the model to normalized distribution is to explicitely normalize Eq.3 over its domain

$\begin{aligned} p (x; Θ) & = \frac{Φ (x; Θ)}{\sum_{x^{'} \in Dom (X)} Φ (x^{'}; Θ)} \\ (4) & = \frac{\exp (- E_{G} (x; Θ) / τ)}{\sum_{x^{'} \in Dom (X)} \exp (- E_{G} (x^{'}; Θ) / τ)} (using Eq.3) \end{aligned}$

This is the probabilistic model implicitely defined by the enery functions over the whole state-space. [We will discuss $τ$ shortly. Consider it to be 1 for now]. If the reader is familiar with Statistical Mechanics at all, they might find it extremely similar to Boltzmann Distribution. Here’s what the Wikipedia says:

In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution) is a probability distribution or probability measure that gives the probability that a system will be in a certain state as a function of that state’s energy …

From now on, Eq.4 will be the sole connection between energy-space and probability-space. We can now forget about potential functions. A 1-D example of an energy function and the corresponding probability distribution is shown below:

Fig.2: An energy function and its corresponding probability distribution.

The denominator of Eq.4 is often known as the “Partition Function” (denoted as $Z$ ). Whatever may be the name, it is quite difficult to compute in general. Because the summation grows exponentially with the space of $X$ .

A hyper-parameter called “temperature” (denoted as $τ$ ) is often introduced in Eq.4 which also has its roots in the original Boltzmann Distribution from Physics. A decrease in temperature gathers the probability mass near the lowest energy regions. If not specified, consider $τ = 1$ for the rest of the article.

A general learning algorithm

The question now is - how do I learn the model given a dataset ? Let’s say my dataset has $N$ samples: $D = {x^{(i)}}_{i = 1}^{N}$ . An obvious way to derive a learning algorithm is to minimize the Negative Log-Likelihood (NLL) loss of the model over our dataset

$\begin{aligned} L (Θ; D) = - \log \prod_{i = 1}^{N} p (x^{(i)}; Θ) & = \sum_{i = 1}^{N} - \log p (x^{(i)}; Θ) \\ = \underset{expectation}{\underset{⏟}{\frac{1}{N} \sum_{i = 1}^{N}}} [E_{G} (x^{(i)}; Θ)] + \log Z \\ (putting Eq.4 followed by trivial calculations, and \\ dividing loss by constant N doesn’t affect optima) \\ = E_{x \sim p_{D}} [E_{G} (x; Θ)] + \log Z \end{aligned}$

Computing gradient w.r.t. parameters yields

$\begin{aligned} \frac{\partial L}{\partial Θ} & = E_{x \sim p_{D}} [\frac{\partial E_{G}}{\partial Θ}] + \frac{\partial}{\partial Θ} \log Z \\ = E_{x \sim p_{D}} [\frac{\partial E_{G}}{\partial Θ}] + \frac{1}{Z} \frac{\partial}{\partial Θ} [\sum_{x^{'} \in Dom (X)} \exp (- E_{G})] (using definition of Z) \\ = E_{x \sim p_{D}} [\frac{\partial E_{G}}{\partial Θ}] + \sum_{x^{'} \in Dom (X)} \underset{RHS of Eq.4}{\underset{⏟}{\frac{1}{Z} \exp (- E_{G})}} \cdot \frac{\partial (- E_{G})}{\partial Θ} \\ (Both Z and the partial operator are independent \\ of x and can be pushed inside the summation) \\ = E_{x \sim p_{D}} [\frac{\partial E_{G}}{\partial Θ}] - \underset{expectation}{\underset{⏟}{\sum_{x^{'} \in Dom (X)} p (x^{'}; Θ)}} \cdot \frac{\partial E_{G}}{\partial Θ} \\ (5) & = E_{x \sim p_{D}} [\frac{\partial E_{G}}{\partial Θ}] - E_{x \sim p_{Θ}} [\frac{\partial E_{G}}{\partial Θ}] \end{aligned}$

Take a few minutes to digest Eq.5. That’s a very important result. It would be worth discussing it a bit further. The first term in Eq.5 is often known as the “Positive Phase” and the second term as “Negative Phase”. The only difference between them, as you can see, is in the distributions on which the expectations are taken. The first expectation is on the data distribution - essentially picking up data from our dataset. The second expectation is on the model distribution - sampling from the model with current parameters. To understand their semantic interpretation, we need to see them in isolation. For the sake of explanation, consider both terms separately yielding a parameter update rule

$Θ := Θ - λ \cdot E_{x \sim D} [\frac{\partial E_{G}}{\partial Θ}], and Θ := Θ + λ \cdot E_{x \sim p_{Θ}} [\frac{\partial E_{G}}{\partial Θ}]$

The first update rule basically tries to changes the parameters in a way it can minimize the enrgy function at points coming from data. And the second one tries to maximize (notice the difference in sign) the energy function at points coming from the model. The original update rule (combining both of them) have both of these effects working simulteneously. The minima of the loss landscape occurs when our model discovers the data distribution, i.e. $p_{Θ} \approx p_{D}$ . At this point, both positive and negative phase is approximately same and the gradient becomes zero (i.e., no more progress). Below is a clear picture of the update process. The algorithm pushes the energy down at places where original data lies; it also pull the energy up at places which the model thinks original data lies.

Fig.3: (a) Model is being optimized. The arrows depict the “pulling up” and “pushing down” of energy landscape. (b) Model has converged to an optimum.

Whatever may be the interpretation, as I mentioned before that the denominator of $p (x; Θ)$ (see Eq.4) is intractable in general case, computing the expectation in negative phase is extremely hard. In fact, that is the only difficulty that makes this algorithm practically challenging.

Gibbs Sampling

As we saw in the last section, the only difficulty we have in implementing Eq.5 is not being able to sample from an intractable density (Eq.4). It tuns out, however, that the conditional densities of a small subset of variables given the others is indeed tractable in most cases. That is because, for conditionals, the $Z$ cancels out. Conditional density of one variable (say $X_{j}$ ) given the others (let’s denote by $X_{- j}$ ) is:

$\begin{matrix} (6) & p (x_{j} | x_{- j}) = \frac{p (x)}{p (x_{- j})} = \frac{\exp (- E_{G} (x))}{\sum_{x_{j}} \exp (- E_{G} (x))} (using Eq.4) \end{matrix}$

I excluded the parameter symbols for notational brevity. That summation in denominator is not as scary as the one in Eq.4 - its only on one variable. We take advantage of this and wisely choose a sampling algorithms that uses conditional densities. Its called Gibbs Sampling. Well, I am not going to prove it. You either have to take my words OR read about it in the link provided. For the sake of this article, just believe that the following works.

To sample $x \sim p_{Θ} (x)$ , we iteratively execute the following for $T$ iterations

We have a sample from last iteration $t - 1$ as $x^{(t - 1)}$
We then pick one variable $X_{j}$ (in some order) at a time and sample from its conditional given the others: $x_{j}^{(t)} \sim p (x_{j} | \underset{current iteration}{\underset{⏟}{x_{1}^{(t)}, \dots, x_{j - 1}^{(t)}}}, \underset{previous iteration}{\underset{⏟}{x_{j + 1}^{(t - 1)}, \dots, x_{D}^{(t - 1)}}})$ . Please note that once we sampled one variable, we fix its value to the latest, otherwise we keep the value from previous iteration.

We can start this process by setting $x^{(0)}$ to anything. If $T$ is sufficiently large, the samples towards the end are true samples from the density $p_{Θ}$ . To know it a bit more rigorously, I highly recommend to go through this. You might be curious as to why this algorithm has an iterative process. Thats because Gibbs sampling is an MCMC family algorithm which has something called “Burn-in period”.

Now that we have pretty much everything needed, let’s explore some popular model based on the general principles of EBMs.

Boltzmann Machine

Boltzmann Machine (BM) is one particular model that has been in the literature for a long time. BM is the simplest one in its family and is used for modelling a binary random vector $X \in {0, 1}^{D}$ with $D$ components $[X_{1}, X_{2}, \dots, X_{D}]$ . All $D$ RVs are connected to all others by an undirected graph $G$ .

By design, BM has a fully connected graph and hence only one maximal clique (containing all RVs). The energy function used in BM is possibly the simplest one you can imagine:

$\begin{matrix} (7) & E_{G} (x; W) = - \frac{1}{2} x^{T} W x \end{matrix}$

Upon expanding the vectorized form (reader is encouraged to try), we can see each term $x_{i} \cdot W_{i j} \cdot x_{j}$ for all $i < j$ as the contribution of pair of RVs $(X_{i}, X_{j})$ to the whole energy function. $W_{i j}$ is the “connection strength” between them. If a pair of RVs $(X_{i}, X_{j})$ turn on together more often, a high value of $W_{i j}$ would encourage reducing the total energy. So by means of learning, we expect to see $W_{i j}$ going up if $(X_{i}, X_{j})$ fire together. This phenomenon is the founding idea of a closely related learning strategy called Hebbian Learning, proposed by Donald Hebb. Hebbian theory basically says:

If fire together, then wire together

How do we learn this model then ? We have already seen the general way of computing gradient. We have $\frac{\partial E_{G}}{\partial W} = - x x^{T}$ . So let’s use Eq.5 to derive a learning rule:

$W := W - λ \cdot (E_{x \sim p_{D}} [- x x^{T}] - E_{x \sim Gibbs (p_{W})} [- x x^{T}])$

Equipped with Gibbs sampling, it is pretty easy to implement now. But my description of the Gibbs sampling algorithm was very general. We have to specialize it for implementing BM. Remember that conditional density we talked about (Eq.6) ? For the specific energy function of BM (Eq.7), that has a very convenient and tractable form:

$p (x_{j} = 1 | x_{- j}; W) = σ (W_{- j}^{T} \cdot x_{- j})$

where $σ (\cdot)$ is the Sigmoid function and $W_{- j} \in R^{D - 1}$ denote the vector of parameters connecting $x_{j}$ with the rest of the variables $x_{- j} \in R^{D - 1}$ . I am leaving the proof for the readers; its not hard, maybe a bit lengthy [Hint: Just put the BM energy function in Eq.6 and keep simplifying]. This particular form makes the nodes behave somewhat like computation units (i.e., neurons) as shown in Fig.5 below:

Fig.5: The computational view of BM showing its dependencies by arrows.

Boltzmann Machine with latent variables

To add more expressiveness in the model, we can introduce latent/hidden variables. They are not observed, but help explaining the visible variables (see my Directed PGM article). However, all variables are still fully connected to each other (shown below in Fig.6(a)).

[ A little disclaimer that as we have already covered a lots of founding ideas, I am going to go over this a bit faster. You may have to look back and find analogies with our previous formulations ]

Fig.6: (a) Undirected graph of BM with hidden units (shaded ones are visible). (b) Computational view of the model while computing conditionals.

Suppose we have $K$ hidden units and $D$ visible ones. The energy function is defined very similar to that of normal BM. Now it contains seperate terms for visible-hidden ( $W \in R^{D \times K}$ ), visible-visible ( $V \in R^{D \times D}$ ) and hidden-hidden ( $U \in R^{K \times K}$ ) interactions. We compactly represent them as $Θ ≜ {W, U, V}$ .

$E_{G} (x, h; Θ) = - x^{T} W h - \frac{1}{2} x^{T} V x - \frac{1}{2} h^{T} U h$

The motivation for such energy function is very similar to original BM. However, our probabilistic form of the model is no longer Eq.4, but the marginalized joint distribution.

$p (x; Θ) = \sum_{h \in Dom (H)} p (x, h; Θ) = \sum_{h \in Dom (H)} \frac{\exp (- E_{G} (x, h))}{\sum_{x^{'}, h^{'} \in Dom (X, H)} \exp (- E_{G} (x^{'}, h^{'}))}$

It might look a bit scary, but its just marginalized over the hidden state-space. Very surprisingly though, the conditionals have pretty similar forms as original BM:

$\begin{array}{r} p (h_{k} | x, h_{- k}) = σ (W \cdot x + U_{- k} \cdot h_{- k}) \\ p (x_{j} | h, x_{- j}) = σ (W \cdot h + V_{- j} \cdot x_{- j}) \end{array}$

Hopefully the notations are clear. If they are not, try comparing with the ones we used before. I recommend the reader to try proving it as an exercise. The diagram in Fig.6(b) hopefully adds a bit more clarity. It shows a similar computation graph for the conditionals we saw before in Fig.5.

Coming to the gradients, they also takes similar forms as original BM .. only difference is that now we have more parameters

$\begin{aligned} W & := W - λ \cdot (E_{x, h \sim p_{D}} [- x h^{T}] - E_{x, h \sim Gibbs (p_{Θ})} [- x h^{T}]) \\ V & := V - λ \cdot (E_{x \sim p_{D}} [- x x^{T}] - E_{x \sim Gibbs (p_{Θ})} [- x x^{T}]) \\ U & := U - λ \cdot (E_{h \sim p_{D}} [- h h^{T}] - E_{h \sim Gibbs (p_{Θ})} [- h h^{T}]) \end{aligned}$

If you are paying attention, you might notice something strange .. how do we compute the terms $E_{h \sim p_{D}}$ (in the positive phase) ? We don’t have hidden vectors in our dataset, right ? Actually, we do have visible vectors $x^{(i)}$ in dataset and we can get an approximate complete data (visible plus hidden) density as

$p_{D} (x^{(i)}, h) = p_{D} (x^{(i)}) \cdot p_{Θ} (h | x^{(i)})$

Basically, we sample a visible data from our dataset and use the conditional to sample a hidden vector. We fix the visible vector and them sample from the hidden vector one component at a time (using Gibbs sampling).

For jointly sampling a visible and hidden vector from the model (for negative phase), we also use Gibbs sampling just as before. We sample all of visible and hidden RVs component by component starting the iteration from any random values. There is a clever hack though. What we can do is we can start the Gibbs iteration by fixing the visible vector to a real data from our dataset (and not anything random). Turns out, this is extremely useful and efficient for getting samples quickly from the model distribution. This algorithm is famously known as “Contrastive Divergence” and has long been used in practical implementations.

“Restricted” Boltzmann Machine (RBM)

Here comes the all important RBM, which is probably one of the most famous energy based models of all time. But, guess what, I am not going to describe it bit by bit. We have already covered enough that we can quickly build on top of them.

RBM is basically same as Boltzmann Machine with hidden units but with one big difference - it doesn’t have visible-visible AND hidden-hidden interactions, i.e.

$U = 0, V = 0$

If you do just that, Boooom ! You get RBMs (see its graphical diagram in Fig.7(a)). It makes the formulation much simpler. I am leaving it entirely for the reader to do majority of the math. Just get rid of $U$ and $V$ from all our formulations in last section and you are done. Fig.7(b) shows the computational view of RBM while computing conditionals.

Fig.7: (a) Graphical diagram of RBM. (b) Arrows just show computation deps

Let me point you out one nice consequence of this model: the conditionals for each visible node is independent of the other visible nodes and this is true for hidden nodes as well.

$\begin{array}{r} p (h_{k} | x) = σ (W_{[:, k]} \cdot x) \\ p (x_{j} | h) = σ (W_{[j, :]} \cdot h) \end{array}$

That means they can be computed in parallel

$\begin{array}{r} p (h | x) = \prod_{k = 1}^{K} p (h_{k} | x) = σ (W \cdot x) \\ p (x | h) = \prod_{j = i}^{D} p (x_{j} | h) = σ (W \cdot h) \end{array}$

Moreover, the Gibbs sampling steps become super easy to compute. We just have to iterate the following steps:

Sample a hidden vector $h^{(t)} \sim p (h | x^{(t - 1)})$
Sample a visible vector $x^{(t)} \sim p (x | h^{(t)})$

This makes RBM an attractive choice for practical implementation.

Whoahh ! That was a heck of an article. I encourage everyone to try working out the RBM math more rigorously by themselves and also implement it in a familiar framework. Alright, that’s all for this article.

References

Citation

BibTeX citation:

@online{das2020,
  author = {Das, Ayan},
  title = {Energy {Based} {Models} {(EBMs):} {A} Comprehensive
    Introduction},
  date = {2020-08-13},
  url = {https://ayandas.me/blogs/2020-08-13-energy-based-models-one.html},
  langid = {en}
}

For attribution, please cite this work as:

Das, Ayan. 2020. “Energy Based Models (EBMs): A Comprehensive Introduction.” August 13, 2020. https://ayandas.me/blogs/2020-08-13-energy-based-models-one.html.