Skip to content
Go back

FMMI: Estimating Mutual Information with Flow Matching

Introduction

The paper “FMMI: Flow Matching Mutual Information Estimation” (Butakov et al. 2025) proposes a new way to estimate MI by learning a continuous normalizing flow that transports the product of marginals PXPYP_X \otimes P_Y into the joint distribution PX,YP_{X,Y}, and then reading MI off from the divergence of the learned velocity field.

In this post, we’ll walk through:

  1. The core idea: MI as a KL divergence and as a difference of entropies
  2. Continuous flows and the identity ddtlogpt(xt)=divv(xt,t)\displaystyle \frac{d}{dt} \log p_t(x_t) = -\operatorname{div} v(x_t,t)
  3. A key lemma: entropy difference as an expectation of divergence
  4. How Flow Matching learns the velocity field from samples
  5. How all of this specializes to a mutual information estimator (jFMMI)
  6. The Monte-Carlo / Hutchinson’s trick implementation

1. Mutual Information and the product of marginals

Let XX and YY be random vectors with joint distribution PX,YP_{X,Y} and marginals PXP_X, PYP_Y. The mutual information is

I(X;Y)=KL(PX,Y  PXPY)I(X;Y) = \mathrm{KL}\big(P_{X,Y} \ \| \ P_X \otimes P_Y\big)

where PXPYP_X \otimes P_Y is the product measure (the joint distribution you’d get if XX and YY were independent).

If densities exist, we also have the familiar entropy identities

I(X;Y)=h(X)+h(Y)h(X,Y)=h(X)h(XY)=h(Y)h(YX)I(X;Y) = h(X) + h(Y) - h(X,Y) = h(X) - h(X|Y) = h(Y) - h(Y|X)

with differential entropy h(X)=E[logpX(X)]h(X) = -\mathbb{E}[\log p_X(X)].

The FMMI paper will express entropy differences via flow matching, and then plug those into these identities to get MI.

2. Continuous flows and divergence

Continuous normalizing flows describe a time-dependent random variable XtX_t via an ODE

dXtdt=v(Xt,t)t[0,1]\frac{d X_t}{dt} = v(X_t, t) \quad t \in [0,1]

where v(x,t)v(x,t) is a velocity field. Let ptp_t be the density of XtX_t. Under mild regularity, continuous flows satisfy the classic identity

tlogpt(Xt)=divv(Xt,t)(1)\frac{\partial}{\partial t} \log p_t(X_t) = -\operatorname{div} v(X_t, t) \tag{1}

This is equivalent to the continuity equation

pt(x)t+div(pt(x)v(x,t))=0\frac{\partial p_t(x)}{\partial t} + \operatorname{div} \big(p_t(x) v(x,t)\big) = 0

and can be seen as the continuous-time version of the change-of-variables formula.

Integrating (1) along the trajectory XtX_t from t=0t=0 to t=1t=1 gives

logp1(X1)logp0(X0)=01divv(Xt,t)dt\log p_1(X_1) - \log p_0(X_0) = -\int_0^1 \operatorname{div} v(X_t,t)dt

Taking expectations will connect divergence to entropy; that’s the next step.

3. Entropy difference = expectation of divergence (Lemma 4.1)

Define differential entropy

h(Xt)=E[logpt(Xt)]h(X_t) = -\mathbb{E}\big[\log p_t(X_t)\big]

Differentiate w.r.t. time using (1):

ddth(Xt)=ddtE[logpt(Xt)]=E[tlogpt(Xt)]=E[divv(Xt,t)]\frac{d}{dt} h(X_t) = -\frac{d}{dt}\mathbb{E}\big[\log p_t(X_t)\big] = -\mathbb{E}\Big[\frac{\partial}{\partial t}\log p_t(X_t)\Big] = \mathbb{E}\big[\operatorname{div} v(X_t,t)\big]

Integrate from t=0t=0 to t=1t=1:

h(X1)h(X0)=01E[divv(Xt,t)]dth(X_1) - h(X_0) = \int_0^1 \mathbb{E}\big[\operatorname{div} v(X_t,t)\big] dt

If we now draw TUniform[0,1]T \sim \mathrm{Uniform}[0,1] and consider the joint distribution of (XT,T)(X_T, T), this integral can be written as a single expectation:

h(X1)h(X0)=ET,XT[divv(XT,T)].(2)h(X_1) - h(X_0) = \mathbb{E}_{T,X_T}\big[\operatorname{div} v(X_T, T)\big]. \tag{2}

4. Learning the velocity with Flow Matching

Of course, in practice we don’t know the true velocity vv. We only have samples from P0P_0 and P1P_1.

Flow Matching (Lipman et al.) gives a way to learn vθ(x,t)v_\theta(x,t) from samples alone, by turning the CNF problem into a regression problem on velocities.

The trick is:

Flow Matching says: instead of integrating the ODE during training, just regress a neural network vθ(x,t)v_\theta(x,t) on these conditional velocities. Concretely, the FM objective is:

LFM(θ)=E[vθ(XT,T)v(XT,TX1)2].(3)\mathcal{L}_\mathrm{FM}(\theta) = \mathbb{E}\big[\|v_\theta(X_T,T) - v(X_T,T \mid X_1)\|^2\big]. \tag{3}

When this regression succeeds, the marginal flow induced by vθv_\theta transports P0P_0 to P1P_1 in the same way the “true” flow would.

This is what the paper calls FMDoE (Flow Matching Difference of Entropy): first learn vθv_\theta by FM, then use (2) to estimate entropy differences via divergence.

5. From entropy differences to mutual information

Now we connect this to mutual information.

Recall one of the entropy forms of MI:

I(X;Y)=h(X)+h(Y)h(X,Y)I(X;Y) = h(X) + h(Y) - h(X,Y)

The idea in FMMI is:

  1. Build flows between carefully chosen distributions, so that entropy differences recover MI.
  2. Use FMDoE to estimate those entropy differences via divergence expectations.

jFMMI: joint-based estimator

The simplest variant, jFMMI, chooses:

Let Z=(X,Y)Z=(X,Y). Under P0P_0, Z0PXPYZ_0 \sim P_X \otimes P_Y, while under P1,Z1PX,YP_1, Z_1 \sim P_{X,Y}. Then:

So the entropy difference is

h(Z1)h(Z0)=h(X,Y)[h(X)+h(Y)]=I(X;Y)h(Z_1) - h(Z_0) = h(X,Y) - [h(X) + h(Y)] = - I(X;Y)

Combine this with the DoE identity (2):

h(Z1)h(Z0)=ET,ZT[divv(ZT,T)]h(Z_1) - h(Z_0) = \mathbb{E}_{T,Z_T}[\operatorname{div} v(Z_T,T)]

where vv is any velocity field whose flow maps PXPYP_X \otimes P_Y into PX,YP_{X,Y}. Therefore,

I(X;Y)=ET,ZT[divv(ZT,T)](4)I(X;Y) = -\mathbb{E}_{T,Z_T}[\operatorname{div} v(Z_T,T)] \tag{4}

This is the key theoretical identity behind jFMMI: mutual information is the negative divergence expectation of any flow that transports the product of marginals into the joint.

In practice we don’t know vv, so we plug in the learned vθv_\theta:

I(X;Y)^=ET,ZT[divvθ(ZT,T)](5)\widehat{I(X;Y)} = -\mathbb{E}_{T,Z_T}[\operatorname{div} v_\theta(Z_T,T)] \tag{5}

That’s jFMMI.

6. Estimating divergence in practice: Hutchinson’s trick

To use (5), we need divvθ(z,t)\operatorname{div} v_\theta(z,t), without computing a full Jacobian.

Let

J(z,t)=vθ(z,t)zRd×d,divvθ(z,t)=Tr(J(z,t)).J(z,t) = \frac{\partial v_\theta(z,t)}{\partial z} \in \mathbb{R}^{d\times d}, \quad \operatorname{div} v_\theta(z,t) = \mathrm{Tr}(J(z,t)).

Hutchinson’s estimator says: if εN(0,Id)\varepsilon \sim \mathcal{N}(0,I_d), then

Tr(J)=Eε[εJε]\operatorname{Tr}(J) = \mathbb{E}_\varepsilon[\varepsilon^\top J \varepsilon]

So an unbiased estimate is:

  1. Sample εRd\varepsilon \in \mathbb{R}^d,
  2. Compute the scalar s=vθ(z,t)εs = v_\theta(z,t) \cdot \varepsilon
  3. Backpropagate zs=Jε\nabla_z s = J^\top \varepsilon,
  4. Form divvθ^(z,t)=εJε=iεi(zs)i\widehat{\operatorname{div} v_\theta}(z,t) = \varepsilon^\top J \varepsilon = \sum_i \varepsilon_i (\nabla_z s)_i

This requires only one backward pass per point, instead of dd passes.

7. Monte-Carlo implementation of jFMMI

Putting everything together, FMMI implements (5) via simple Monte-Carlo:

  1. Sample a batch of pairs (z0,z1)(z_0,z_1) where
    • z1PX,Yz_1 \sim P_{X,Y} (true joint),
    • z0PXPYz_0 \sim P_X \otimes P_Y (shuffle Y independently to break dependence).
  2. Sample tUniform[0,1]t \sim \mathrm{Uniform}[0,1].
  3. Build the interpolated points zt=(1t)z0+tz1z_t = (1-t) z_0 + t z_1
  4. Evaluate the learned velocity field vθ(zt,t)v_\theta(z_t,t).
  5. Estimate the divergence divvθ^(zt,t)\widehat{\operatorname{div} v_\theta}(z_t,t) via Hutchinson.
  6. Each sample gives a stochastic MI contribution I^i=divvθ^(zt(i),t(i))\hat I_i = -\widehat{\operatorname{div} v_\theta}(z_t^{(i)}, t^{(i)})
  7. Average over many samples: I(X;Y)^=1Ni=1NI^i\widehat{I(X;Y)} = \frac{1}{N}\sum_{i=1}^N \hat I_i

8. Why this is interesting

Compared to classic discriminative MI estimators, FMMI:

Conceptually, it reframes MI estimation as:

Learn a path that morphs independence into the true joint, and read MI as the total log-volume change along the way.

Once you see MI as entropy difference, and entropy difference as expected divergence of a flow, the whole construction becomes very natural.

Experiments

To validate the jFMMI estimator introduced in the paper, we designed a controlled numerical experiment where the true mutual information (MI) is known analytically. This allows us to measure estimation accuracy, bias, and convergence of the proposed method.

Synthetic Data Setup

We consider a simple bivariate Gaussian model:

XN(0,1)Y=ρX+1ρ2εεN(0,1)X \sim \mathcal N(0,1) \qquad Y = \rho X + \sqrt{1-\rho^2}\varepsilon\qquad \varepsilon \sim \mathcal N(0,1)

With this construction, the mutual information has a closed-form:

I(X;Y)=12log(1ρ2)I(X;Y) = -\tfrac12 \log(1 - \rho^2)

We set ρ=0.9\rho = 0.9, giving a ground-truth MI of:

Itrue0.8304I_{\text{true}} \approx 0.8304

We generate 30,000 samples from the joint distribution PX,YP_{X,Y}. To obtain samples from the product of marginals PXPYP_X \otimes P_Y, we shuffle the YY values independently from XX. This gives endpoints (Z0,Z1)(Z_0, Z_1) for flow matching.

Flow Matching and Velocity Network

We train a neural velocity field vθ(z,t)v_\theta(z,t) with 512 hidden units and Tanh activations to match the true conditional velocity along the linear path:

Zt=(1t)Z0+tZ1v(Zt,t)=Z1Z0Z_t = (1-t)Z_0 + tZ_1 \qquad v_*(Z_t,t) = Z_1 - Z_0

The Flow Matching loss is:

LFM=Evθ(Zt,t)(Z1Z0)2\mathcal L_{FM} = \mathbb E\|v_\theta(Z_t,t) - (Z_1 - Z_0)\|^2

The model is trained for 5,000 gradient steps with AdamW and cosine annealing.

Although the FM loss exhibits mild oscillation, the flow nonetheless converges well: the learned transport visually matches the ground-truth Gaussian interpolation when animated.

Mutual Information Estimation

After training, we estimate MI using the jFMMI identity:

I(X;Y)=E[divvθ(Zt,t)]I(X;Y) = -\mathbb E\big[\mathrm{div} v_\theta(Z_t,t)\big]

with divergence estimated via Hutchinson’s estimator:

divvθ(z,t)εJvθ(z,t)εεN(0,I)\mathrm{div} v_\theta(z,t) \approx \varepsilon^\top J_{v_\theta}(z,t)\varepsilon \qquad \varepsilon \sim \mathcal N(0, I)

We draw 50,000 Monte-Carlo samples (Zt,t)(Z_t,t) to approximate the expectation.

Results

The estimator converges towards the true MI:

The curve NI^NN \mapsto \hat I_N shows fast convergence but a small positive bias, expected due to imperfect learning of the velocity field. Increasing network depth significantly reduces this bias, confirming the role of expressivity.

convergence

Flow Visualization

We additionally animate the learned flow and the ground-truth flow simultaneously. The trajectories of particles under both flows almost perfectly overlap, demonstrating that the learned vθv_\theta captures the correct transport from PXPYP_X\otimes P_Y to PX,YP_{X,Y}.

This visual inspection confirms that Flow Matching learned the intended probability path, validating the theoretical assumptions behind jFMMI.

animation


Share this post on:

Next Post
Geodesic Calculus on Latent Space