While a plethora of high-quality text-to-image diffusion models (Ramesh et al. 2022; Rombach et al. 2022; Podell et al. 2024) emerged in the last few years, credit mostly goes to the tremendous engineering efforts put into them. The fundamental theory behind them however, remained largly untouched – well, untill recently. The traditional “markov-chain” (Ho, Jain, and Abbeel 2020) or “SDE” (Song et al. 2021) perspective is now being replaced by an arguably simpler and flexible alternative. Under the new approach, we ditch the usual notion of a stochastic mapping as the model and adopt a deterministic mapping instead. Turns out that such model, although existed, never took off due the absense of a scalable learning algorithm. In this article, we provide a relatively easy and visually guided tour of “Flow Matching”, followed by ideas like “path straightening” & “reflow”. It is worth mentioning that this very idea powered the most recent version of Stable Diffusion (i.e. SD3 by Esser et al. (2024)).

Some prerequisites

Note that this article relies on the fact that you are familiar with the core ideas of Diffusion Models. So if you haven’t already, please visit my previous blogs (Blog #1, Blog #2 & Blog #3) on Diffusion/Score models.

Every scalable generative model follows a “sampling first” philosophy, i.e. they are defined in terms of a (learnable) mapping (say $F_{θ}$ ), which maps (known) pure noise $z$ to desired data $x$

$\begin{matrix} (1) & x = F_{θ} (ϵ), where ϵ \sim N (0, I) \end{matrix}$

The learning objective is crafted in a way that the $F_{θ}$ induces a model distribution on $x$ , i.e. $p_{θ} (x)$ that matches a given $q_{d a t a} (x)$ as closely as possible. That is, of course, from the samples of $q_{d a t a} (x)$ only. Different generative model families learn (parameters of) the mapping in different ways.

Brief overview of Diffusion Models

The generative process

Although there are mutiple formalisms to describe the underlying theory of Diffusion Models, the one that gained traction recently is the “Differential Equations” view, mostly due to Song et al. (2021). Under this formalism, Diffusion’s generative mapping (in Equation 1) can be realized by integrating a differential equation in time $t$ . Song et al. (2021) also showed that there are two equivalent generative processes, one stochastic and another deterministic, that leads to exact same noise-to-data mapping.

$\begin{matrix} (2) & d x_{t} = [f (t) x_{t} - g^{2} (t) \underset{\approx s_{θ} (x_{t}, t)}{\underset{⏟}{\nabla_{x_{t}} \log p_{t} (x_{t})}}] d t + g (t) d \bar{w} \end{matrix}$ The hyper-parameters $f (t)$ and $g (t)$ are scalar functions of time.

$\begin{matrix} (3) & d x_{t} = [f (t) x_{t} - 0.5 \cdot g^{2} (t) \underset{\approx s_{θ} (x_{t}, t)}{\underset{⏟}{\nabla_{x_{t}} \log p_{t} (x_{t})}}] d t + g (t) d \bar{w} \end{matrix}$ The hyper-parameters $f (t)$ and $g (t)$ are scalar functions of time.

where $p_{t} (x_{t})$ is a noisy version of the data density¹ induced by a (known & fixed) forward process over time $t$

¹ Knowns as the forward process marginal

$d x_{t} = f (t) x_{t} d t + g (t) d w$

$\begin{matrix} (4) & x_{t} = α (t) x_{0} + σ (t) ϵ \end{matrix}$

$p_{t} (x_{t} | x_{0}) = N (x_{t}; α (t) x_{0}, σ (t)^{2} I)$

Conversion between

f, g

and

α, σ

The relationship between ${f (t), g (t)}$ and ${α (t), σ (t)}$ is as follows.

$\begin{matrix} (5) & f (t) = \frac{d}{d t} \log α (t), and g (t)^{2} = - 2 σ (t)^{2} \frac{d}{d t} \log \frac{α (t)}{σ (t)} \end{matrix}$

Despite being able sample $x_{t}$ using Equation 4, the quantity $\nabla_{x_{t}} \log p_{t} (x)$ ² in Equation 2 or Equation 3 is not analytically computable and therefore learned using a neural function.

² The ‘score’ of the marginal

The learning objective

A parametric neural function $s_{θ} (x_{t}, t)$ is regressed using a different regression target than the ‘true’ target

$\begin{matrix} (6) & L (θ) = E_{t, x_{0} \sim q_{d a t a} (x), x_{t} \sim p_{t} (x_{t} | x_{0})} [| | s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log p_{t} (x_{t} | x_{0}) | |^{2}] \end{matrix}$ This objective is known as “Denoising Score Matching” (Vincent 2011).

$\begin{matrix} (7) & L (θ) = E_{t, x_{0} \sim q_{d a t a} (x), x_{t} \sim p_{t} (x_{t} | x_{0})} [| | s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log p_{t} (x_{t}) | |^{2}] \end{matrix}$ This is the true objective that is not directly computable.

It was proved initially by Vincent (2011) and later re-established by Song et al. (2021) that using $\nabla_{x_{t}} \log p_{t} (x_{t} | x_{0})$ ³ (in Equation 6) as an alternative target for regression can still lead to an ubiased estimate of the true $\nabla_{x_{t}} \log p_{t} (x_{t})$ . The figure below show graphically, the quantity (red arrows) our parametric score model regresses against at an arbitrary timestep $t$ .

³ Please note that $\nabla_{x_{t}} \log p_{t} (x_{t} | x_{0}) = - \frac{x_{t} - x_{0}}{σ_{t}^{2}} = - \frac{ϵ}{σ_{t}}$

The parametric model regresses against the vector that points towards the original data point.

Matching flows, not scores

The idea of re-interpreting reverse diffusion as “flow”⁴ stems from a holistic observation. Note that given Equation 3, an ODE dynamics is guaranteed to exists which induces a deterministic mapping from noise to any given data distribution. In diffusion model’s framework, we were only learning a part of the dynamics – not the dynamics itself.

⁴ This term comes from Continuous Normalizing Flows (CNFs)

$\begin{matrix} (8) & d x_{t} = [\underset{.. why not learn this ?}{\underset{⏟}{f (t) x_{t} - 0.5 \cdot g^{2} (t) \overset{Instead of this ..}{\overset{⏞}{s_{θ} (x_{t}, t)}}}}] d t \end{matrix}$

A deterministic model

Following the observation above, we can assume a generative model realized by a deterministic ODE simulation, whose parametric dynamics subsumes $f (t), g (t)$ and the parametric scores $s_{θ} (x_{t}, t)$

$\begin{matrix} (9) & d x_{t} = v_{θ} (x_{t}, t) d t, where x_{1} := ϵ \sim N (0, I) \end{matrix}$

Turns out, models like Equation 9 have already been investigated in generative modelling literature under the name of Continuous Normalizing Flows (Chen et al. 2018). However, these models never made it to the scalable realm due to their “simulation based”⁵ learning objective. The dynamics is often called “velocity”⁶ or “velocity field” and denoted with $v$ .

⁵ One must integrate or simulate the ODE during training.

⁶ It is a time derivative of position.

⁷ .. or as it is now called, the ‘Flow Matching’ loss

Upon inspecting the pair of Equation 8 and Equation 6, it is not particularly hard to sense the existance of an equivalent ‘velocity matching’ loss⁷ for the new flow model in Equation 9.

An equivalent objective

To see that exact form of the flow matching loss, simply try recreating the ODE dynamics in Equation 8 within Equation 6 by appending some extra terms that cancel out

$\begin{aligned} L (θ) = E_{t, x_{0} \sim q_{d a t a} (x), x_{t} \sim p_{t} (x_{t} | x_{0})} & [| | \frac{2}{g^{2} (t)} {(f (t) x_{t} - \frac{1}{2} g^{2} (t) s_{θ} (x_{t}, t)) \\ {- (f (t) x_{t} - \frac{1}{2} g^{2} (t) \nabla_{x_{t}} \log p_{t} (x_{t} | x_{0}))} | |}^{2}] \end{aligned}$

The expression within the first set of parantheses ( .. ) is equivalent to what now call the parametric velocity/flow or $v_{θ} (x_{t}, t)$ . The expression in second set of parantheses is a proxy regression target (let’s call it $v (x_{t}, t)$ ), equivalent to $\nabla_{x_{t}} \log p_{t} (x_{t} | x_{0})$ in score matching (see Equation 6). With the help of ${f, g} ⇄ {α, σ}$ conversion (see Equation 5), it’s relatively easy to show that $v (x_{t}, t)$ can be written as the time-derivative of the forward sampling process

Reveal Proof

$\begin{aligned} v (x_{t}, t) & ≜ f (t) x_{t} - \frac{1}{2} g^{2} (t) \nabla_{x_{t}} \log p_{t} (x_{t} | x_{0}) \\ = f (t) (α (t) x_{0} + σ (t) ϵ) - [- σ (t) \frac{g (t)^{2}}{2 σ (t)} \frac{ϵ}{σ (t)}] \\ = \frac{\dot{α} (t)}{α (t)} (α (t) x_{0} + σ (t) ϵ) - [σ (t) (\frac{\dot{α} (t)}{α (t)} - \frac{\dot{σ} (t)}{σ (t)}) ϵ] \\ = \dot{α} (t) x_{0} + \frac{\dot{α} (t)}{α (t)} σ (t) ϵ - σ (t) \frac{\dot{α} (t)}{α (t)} ϵ + \dot{σ} (t) ϵ \end{aligned}$

$v (x_{t}, t) = \dot{α} (t) x_{0} + \dot{σ} (t) ϵ = {\dot{x}}_{t}$

To summarize, the following is the general form of flow matching loss

$\begin{matrix} (10) & L_{F M} (θ) = E_{t, x_{0} \sim q_{d a t a} (x), x_{t} \sim p_{t} (x_{t} | x_{0})} [| | v_{θ} (x_{t}, t) - (\underset{{\dot{x}}_{t}}{\underset{⏟}{\dot{α} (t) x_{0} + \dot{σ} (t) ϵ}}) | |^{2}] \end{matrix}$

In practice, as proposed by many (Lipman et al. 2022; Liu, Gong, and Liu 2022), we discard the weightning term just like Diffusion Model’s simple loss popularized by Ho, Jain, and Abbeel (2020). We may think of ${\dot{x}}_{t}$ as a stochastic velocity induced by the forward process. The model, when minimized with Equation 10, tries to mimic the stochastic velocity, but without having access to $x_{0}$ .

Be careful with time direction

While the learning objective regresses against ${\dot{x}}_{t}$ , we need $- {\dot{x}}_{t}$ for sampling. The extra negative is induced automatically while simulating Equation 9 in reverse time ( $d t$ being negative). It is therefore equivalent to think the regression target to be $- {\dot{x}}_{t}$ .

The parametric model regresses against the instantaneous *stochastic velocity* at any point $x_{t}$ on the stochastic path.

The minima & its interpretation

The loss in Equation 10 can be shown to be equivalent⁸ to

⁸ Their gradients are equal, but the losses are not.

$\begin{matrix} (11) & L_{F M} (θ) = E_{t, x_{t} \sim p_{t} (x_{t})} [| | v_{θ} (x_{t}, t) - E_{x_{0} \sim p (x_{0} | x_{t})} [{\dot{x}}_{t}] | |^{2}] \end{matrix}$

which implies that the loss reaches its minima⁹ when the model perfectly learns

⁹ This is a typical MMSE estimator.

$v_{*} (x_{t}, t) ≜ E_{x_{0} \sim p (x_{0} | x_{t})} [{\dot{x}}_{t}]$

Hence, $v_{θ}$ can be regarded as the variational approximation of the posterior velocity $v_{*}$ . Note that $v_{*}$ is non-causal, i.e. it has access to the true data distribution $q_{d a t a} (x_{0})$ . On the other hand, the model $v_{θ}$ must be causal. Hence, the learning process “causalizes” (Liu, Gong, and Liu 2022) the stochastic path.

The forward stochastic paths $x_{t}$ are also overlapping, while the model is a function¹⁰. The expectation $E_{x_{0} \sim p (x_{0} | x_{t})} [\cdot]$ averages the stochastic velocity over all possible real data points $x_{0} | x_{t}$ , leading to smooth velocity fields learned by the model.

¹⁰ cannot have multiple value at one given point

The optimally learned generative process (Equation 9) can therefore be expressed as

$d x_{t} = v_{*} (x_{t}, t) d t, where x_{1} := ϵ \sim N (0, I)$

Straightening & ReFlow

$t$ -independent stochastic velocity

With the general theory understood, it is now easy to concieve the idea of straight flows. It simply refers to the following special case

$α (t) = 1 - t, σ (t) = t$

which implies the forward process and its time derivative to be

$\begin{aligned} x_{t} & = (1 - t) \cdot x_{0} + t \cdot x_{1} \\ ⟹ {\dot{x}}_{t} & = - x_{0} + x_{1} \end{aligned}$

What is important is the stochastic velocity is independent of time¹¹ and the $x_{t}$ trajectory itself is a stright line.

¹¹ Not “constant” – it still depends on data $x_{0}$ and noise sample $ϵ$ , which are random variables.

This however, does not mean that the learned model will produce straight paths – it only means we’re supervising the model to follow a path as straight as possible. An analogous illustration was provided in Liu, Gong, and Liu (2022) which is a bit more descriptive.

The stochastic straight paths (left) used for training had crossovers. However, the learned model (right) resolves them. (Taken from Liu, Gong, and Liu (2022))

This learning problem (Equation 11) effectively turns an independent stochastic ‘coupling’ $q_{d a t a} (x_{0}) N (x_{1}; 0, I)$ into a deterministic coupling $p (z_{0}, z_{1})$ with some dependence. This prcoess $(x_{0}, x_{1}) \to (z_{0}, z_{1})$ has been termed (by Liu, Gong, and Liu (2022)) as “Rectification”. Samples from this deterministic coupling can be drawn by drawing noise sample $z_{1} := ϵ \sim N (ϵ; 0, I)$ and then simulating the flow in Equation 9 with the learned model

$z_{0} = z_{1} + \int_{t = 1}^{t = 0} v_{θ} (z_{t}, t) d t .$

It can be proved that the samples $(z_{0}, z_{1})$ are, in average, closer to each other than that of $(x_{0}, x_{1})$ .

Reveal Proof

$\begin{aligned} E_{p (z_{0}, z_{1})} [| | z_{0} - z_{1} | |] & = E_{p (z_{0}, z_{1})} [| | \int_{1}^{0} v_{*} (z_{t}, t) d t | |] \\ \leq E_{p (z_{0}, z_{1})} [\int_{1}^{0} | | v_{*} (z_{t}, t) | | d t] \\ = E_{p (x_{0}, x_{1})} [\int_{1}^{0} | | v_{*} (x_{t}, t) | | d t] \\ = E_{p (x_{0}, x_{1})} [\int_{1}^{0} | | E [{\dot{x}}_{t} | x_{t}] | | d t] \\ \leq E_{p (x_{0}, x_{1})} [\int_{1}^{0} E [| | {\dot{x}}_{t} | | | x_{t}] d t] \\ = \int_{1}^{0} E_{p (x_{0}, x_{1})} [E [| | x_{0} - x_{1} | | | x_{t}]] d t \\ = E_{p (x_{0}, x_{1})} [| | x_{0} - x_{1} | |] \end{aligned}$

The proof uses the following

The fact that $| | \cdot | |$ is a convex cost function.
Convex functions can be exchanged with $E$ according to Jensen’s inequality.
$z_{t}$ and $x_{t}$ has the same marginal and can be exchanges as a random variable.
Assumes our model learns the perfect $v_{*}$ .
Law of iterated expectation.

Please see section 3.2 of Liu, Gong, and Liu (2022) for more details on the proof.

Reflow

The process of rectification however, does not guarantee (as you can see in the above figure) the new coupling to have straight paths between each pair. Liu, Gong, and Liu (2022) suggested the “Reflow” procedure, which is nothing but learning a new model using the samples of $p (z_{0}, z_{1})$ .

$L_{F M} (ϕ) = E_{t, (z_{0}, ϵ) \sim p (z_{0}, ϵ)} [| | v_{ϕ} (z_{t}, t) - (- z_{0} + ϵ) | |^{2}]$

The ‘reflow’-ed coupling $(z_{2}, ϵ)$ is guaranteed to have paths straighter than $(z_{1}, ϵ)$ . One can, in fact, repeat this procedure as many times as they want. Another figure from Liu, Gong, and Liu (2022) perfectly demonstrates successive reflows

Successive reflows produce more and more straight paths.

In this article, we talked about Flow Matching, Rectification and Reflow – some of the emerging new ideas in Diffusion Model literature. Specifically, we looked into the theoretical definitions and justifications behind the ideas. Despite having an appealing outlook, some researchers are skeptical of it being a special case of good old Diffusion Models. Whatever the case maybe, it did deliver one of the best text-to-image model so far (Esser et al. 2024), pehaps with a bit of clever engineering, which is a topic of another day.

References

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.” In NeurIPS.

Esser, Patrick, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, et al. 2024. “Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.” arXiv Preprint arXiv:2403.03206.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” In NeurIPS.

Lipman, Yaron, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2022. “Flow Matching for Generative Modeling.” In The Eleventh International Conference on Learning Representations.

Liu, Xingchao, Chengyue Gong, and Qiang Liu. 2022. “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.” arXiv Preprint arXiv:2209.03003.

Podell, Dustin, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2024. “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.” In The Twelfth International Conference on Learning Representations.

Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. “Hierarchical Text-Conditional Image Generation with Clip Latents.” arXiv Preprint arXiv:2204.06125.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. “High-Resolution Image Synthesis with Latent Diffusion Models.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–95.

Song, Yang, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In ICLR.

Vincent, Pascal. 2011. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation.

Citation

BibTeX citation:

@online{das2024,
  author = {Das, Ayan},
  title = {Match Flows, Not Scores},
  date = {2024-04-26},
  url = {https://ayandas.me/blogs/2024-04-26-flow-matching-strightning-sd3.html},
  langid = {en}
}

For attribution, please cite this work as:

Das, Ayan. 2024. “Match Flows, Not Scores.” April 26, 2024. https://ayandas.me/blogs/2024-04-26-flow-matching-strightning-sd3.html.