While a plethora of high-quality text-to-image diffusion models (Ramesh et al. 2022; Rombach et al. 2022; Podell et al. 2024) emerged in the last few years, credit mostly goes to the tremendous engineering efforts put into them. The fundamental theory behind them however, remained largly untouched – well, untill recently. The traditional “markov-chain” (Ho, Jain, and Abbeel 2020) or “SDE” (Song et al. 2021) perspective is now being replaced by an arguably simpler and flexible alternative. Under the new approach, we ditch the usual notion of a stochastic mapping as the model and adopt a deterministic mapping instead. Turns out that such model, although existed, never took off due the absense of a scalable learning algorithm. In this article, we provide a relatively easy and visually guided tour of “Flow Matching”, followed by emerging ideas like “path straightening” & “rectification”. It is worth mentioning that this very idea powered the most recent version of Stable Diffusion (i.e. SD3 by Esser et al. (2024)).
Every scalable generative model follows a “sampling first” philosophy, i.e. they are defined in terms of a (learnable) mapping (say \(F_{\theta}\)), which maps (known) pure noise \(z\) to desired data \(x\)
\[ x = F_{\theta}(z), \text{ where } z \sim \mathcal{N}(0, I) \tag{1}\]
The learning objective is crafted in a way that the \(F_{\theta}\) induces a model distribution on \(x\), i.e. \(p_{\theta}(x)\) that matches a given \(q_{data}(x)\) as closely as possible. That is, of course, from the samples of \(q_{data}(x)\) only. Different generative model families learn the (parameters of the) mapping in different ways.
Brief overview of Diffusion Models
The generative process
Although there are mutiple formalisms to describe the underlying theory of Diffusion Models, the one that gained traction recently is the “Differential Equations” view, mostly due to Song et al. (2021). Under this formalism, Diffusion’s generative mapping (in Equation 1) can be realized by integrating a differential equation in time \(t\). Song et al. (2021) also showed that there are two equivalent generative processes, one stochastic and another deterministic, that leads to exact same noise-to-data mapping.
\[ dx_t = \Big[f(t) x_t - g^2(t) \underbrace{\nabla_{x_t} \log p_t(x_t)}_{\approx s_{\theta}(x_t, t)} \Big] dt + g(t) d\bar{w} \tag{2}\] The hyper-parameters \(f(t)\) and \(g(t)\) are scalar functions of time.
\[ dx_t = \Big[f(t) x_t - {\color{blue}0.5}\ \cdot g^2(t) \underbrace{\nabla_{x_t} \log p_t(x_t)}_{\approx s_{\theta}(x_t, t)} \Big] dt \cancel{+ g(t) d\bar{w}} \tag{3}\] The hyper-parameters \(f(t)\) and \(g(t)\) are scalar functions of time.
where \(p_t(x_t)\) is a noisy version of the data density1 induced by a (known & fixed) forward process over time \(t\)
1 Knowns as the forward process marginal
\[ dx_t = f(t)x_t dt + g(t) dw \]
\[ x_t = \alpha(t) x_0 + \sigma(t) \epsilon \tag{4}\]
\[ p_t(x_t | x_0) = \mathcal{N}(x_t; \alpha(t) x_0, \sigma(t)^2 \mathrm{I}) \]
The relationship between \(\{ f(t), g(t) \}\) and \(\{ \alpha(t), \sigma(t) \}\) is as follows.
\[ f(t) = \frac{d}{dt} \log \alpha(t), \text{ and } g(t)^2 = - 2 \sigma(t)^2 \frac{d}{dt} \log \frac{\alpha(t)}{\sigma(t)} \tag{5}\]
Despite being able sample \(x_t\) using Equation 4, the quantity \(\nabla_{x_t} \log p_t(x)\)2 in Equation 2 or Equation 3 is not analytically computable and therefore learned using a neural function.
2 The ‘score’ of the marginal
The learning objective
A parametric neural function \(s_{\theta}(x_t, t)\) is regressed using a different regression target than the ‘true’ target
\[ \mathcal{L}(\theta) = \mathbb{E}_{t, x_0 \sim q_{data}(x), x_t \sim p_t(x_t | x_0)} \Big[ || s_{\theta}(x_t, t) - \nabla_{x_t} \log p_t(x_t | x_0) ||^2 \Big] \tag{6}\] This objective is known as “Denoising Score Matching” (Vincent 2011).
\[ \mathcal{L}(\theta) = \mathbb{E}_{t, x_0 \sim q_{data}(x), x_t \sim p_t(x_t | x_0)} \Big[ || s_{\theta}(x_t, t) - \nabla_{x_t} \log p_t(x_t) ||^2 \Big] \tag{7}\] This is the true objective that is not directly computable.
It was proved initially by Vincent (2011) and later re-established by Song et al. (2021) that using \(\nabla_{x_t} \log p_t(x_t | x_0)\) (in Equation 6) as an alternative target for regression can still lead to an ubiased estimate of the true \(\nabla_{x_t} \log p_t(x_t)\).
Matching flows, not scores
The idea of re-interpreting reverse diffusion as “flow”3 stems from a holistic observation. Note that given Equation 3, an ODE dynamics is guaranteed to exists which induces a deterministic mapping from noise to any given data distribution. In diffusion model’s framework, we were only learning a part of the dynamics – not the dynamics itself.
3 This term comes from Continuous Normalizing Flows (CNFs)
\[ dx_t = \Bigg[ \underbrace{f(t) x_t - 0.5\ \cdot g^2(t) \overbrace{s_{\theta}(x_t, t)}^{\text{Instead of this ..}}}_{\text{.. why not learn this ?}} \Bigg] dt \tag{8}\]
A deterministic model
Following the observation above, we can assume a generative model realized by a deterministic ODE simulation, whose parametric dynamics subsumes \(f(t), g(t)\) and the parametric scores \(s_{\theta}(x_t, t)\)
\[ dx_t = v_{\theta}(x_t, t) dt, \text{ where } x_1 := z \sim \mathcal{N}(0, I) \tag{9}\]
Turns out, models like Equation 9 have already been investigated in generative modelling literature under the name of Continuous Normalizing Flows (Chen et al. 2018). However, these models never made it to the scalable realm due to their “simulation based”4 learning objective. The dynamics is often called “velocity”5 or “velocity field” and denoted with \(v\).
4 One must integrate or simulate the ODE during training.
5 It is a time derivative of position.
6 .. or as it is now called, the ‘Flow Matching’ loss
Upon inspecting the pair of Equation 8 and Equation 6, it is not particularly hard to sense the existance of an equivalent ‘velocity matching’ loss6 for the new flow model in Equation 9.
An equivalent objective
To see that exact form of the flow matching loss, simply try recreating the ODE dynamics in Equation 8 within Equation 6 by appending some extra terms that cancel out
\[ \begin{split} \mathcal{L}(\theta) = \mathbb{E}_{t, x_0 \sim q_{data}(x), x_t \sim p_t(x_t | x_0)} &\Bigg[ \left|\left| \frac{2}{g^2(t)} \left\{ \left( f(t) x_t - \frac{1}{2} g^2(t) s_{\theta}(x_t, t) \right) \right. \right. \right. \Bigg. \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \Bigg. \left. \left. \left. - \left( f(t) x_t - \frac{1}{2} g^2(t) \nabla_{x_t} \log p_t(x_t | x_0) \right) \right\} \right|\right|^2 \Bigg] \end{split} \]
The expression within the first set of parantheses ( .. )
is equivalent to what now call the parametric velocity/flow or \(v_{\theta}(x_t, t)\). The expression in second set of parantheses is a proxy regression target (let’s call it \(v(x_t, t)\)), equivalent to \(\nabla_{x_t} \log p_t(x_t | x_0)\) in score matching (see Equation 6). With the help of \(\{ f, g \} \rightleftarrows \{ \alpha, \sigma \}\) conversion (see Equation 5), it’s relatively easy to show that \(v(x_t, t)\) can be written as the time-derivative of the forward sampling process
\[ \begin{aligned} v(x_t, t) &\triangleq f(t) x_t - \frac{1}{2} g^2(t) \nabla_{x_t} \log p_t(x_t | x_0) \\ &= f(t)(\alpha(t) x_0 + \sigma(t) \epsilon) - \left[- \sigma(t) \frac{g(t)^2}{2 \sigma(t) } \frac{\epsilon}{\sigma(t)} \right] \\ &= \frac{\dot{\alpha}(t)}{\alpha(t)}(\alpha(t) x_0 + \sigma(t) \epsilon) - \left[ \sigma(t) \left( \frac{\dot{\alpha}(t)}{\alpha(t)} - \frac{\dot{\sigma}(t)}{\sigma(t)} \right) \epsilon \right] \\ &= \dot{\alpha}(t) x_0 + \cancel{\frac{\dot{\alpha}(t)}{\alpha(t)} \sigma(t) \epsilon} - \cancel{\sigma(t) \frac{\dot{\alpha}(t)}{\alpha(t)} \epsilon} + \dot{\sigma}(t) \epsilon \end{aligned} \]
\[ v(x_t, t) = \dot{\alpha}(t) x_0 + \dot{\sigma}(t) \epsilon = \dot{x}_t \]
To summarize, the following is the general form of flow matching loss
\[ \mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, x_0 \sim q_{data}(x), x_t \sim p_t(x_t | x_0)} \Bigg[ \Bigg|\Bigg| v_{\theta}(x_t, t) - (\underbrace{\dot{\alpha}(t) x_0 + \dot{\sigma}(t) \epsilon}_{\dot{x}_t}) \Bigg|\Bigg|^2 \Bigg] \tag{10}\]
In practice, as proposed by many (Lipman et al. 2022; Liu, Gong, and Liu 2022), we discard the weightning term just like Diffusion Model’s simple loss popularized by Ho, Jain, and Abbeel (2020). We may think of \(\dot{x}_t\) as a stochastic velocity induced by the forward process. The model, when minimized with Equation 10, tries to mimic the stochastic velocity, but without having access to \(x_0\).
The minima & its interpretation
The loss in Equation 10 can be shown to be equivalent to
\[ \mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, x_t \sim p_t(x_t)} \Bigg[ \Bigg|\Bigg| v_{\theta}(x_t, t) - \mathbb{E}_{x_0 \sim p(x_0|x_t)} \Big[\dot{x}_t\Big] \Bigg|\Bigg|^2 \Bigg] \]
which implies that the loss reaches its minima when the model perfectly learns
\[ v_*(x_t, t) \triangleq \mathbb{E}_{x_0 \sim p(x_0|x_t)} \left[\dot{x}_t\right] \]
References
Citation
@online{das2024,
author = {Das, Ayan},
title = {Flow {Matching,} {Straightening} \& {Rectification}},
date = {2024-04-26},
url = {https://ayandas.me//blogs/2024-04-26-flow-matching-strightning-sd3.html},
langid = {en}
}