The Hutchinson trick: a cheap way to evaluate the trace of a Jacobian, without computing the Jacobian itself!
Continuous normalizing flows (more in a blog post to come) heavily rely on the computation of the divergence of a network \(f: \bbR^d \to \bbR^d\), aka the trace of its Jacobian:
\[\begin{equation} \DIV f (x) = \tr \jac_f(x) \triangleq \sum_{i=1}^d \frac{\mathrm{d} f_i}{\mathrm{d} x_i} (x) \end{equation}\]Computing this quantity would require evaluating the full Jacobian, which can be done at the prohibitive cost of \(d\) backpropagations (more details below). It turns out, the trace/divergence can be approximated reasonably well with a single call to backpropagation.
For that, let’s forget about Jacobians for a second and take a generic matrix \(A \in \bbR^{d \times d}\). The Hutchinson trick states that for any random variable \(z \in \bbR^d\) such that \(\bbE[zz^\top] = \Id_d\),
\[\begin{equation}\label{eq:hutchinson} \tr(A) = \bbE_z [z^\top A z] \end{equation}\]It is typically used for \(z\) having iid entries of mean zero and variance, classically standard Gaussian or Rademacher. The proof is very simple:
\[\begin{align*} \bbE_z [z^\top A z] &= \bbE_z [\tr(z^\top A z)] \quad &\text{(a scalar equals its trace)} \\ &= \bbE_z [\tr (A z z^\top)] \quad &(\tr(MN) = \tr(NM)) \\ &= \tr \left(\bbE_z [A z z^\top]\right) \quad &\text{(trace and expectation commute)} \\ &= \tr \left(A \bbE_z [z z^\top] \right) \quad &\text{(linearity of expectation)} \\ &= \tr \left(A \Id_d \right) \quad &\text{(assumption on $z$ )} \\ &= \tr \left(A \right) \quad & \end{align*}\]et voilà!
The numerical benefit is not obvious at first: if one has access to \(A \in \bbR^{d \times d}\), computing its trace costs \(\mathcal{O}(d)\), while the above formula requires \(\mathcal{O}(d^2)\) to compute \(z^\top A z\), not to mention multiple Monte-Carlo samples for the expectation. But there are cases in which one wants to compute the trace of a matrix, without having explicit access to said matrix!
The flagship example is the full Jacobian of a neural network \(f\). Explicitely computing it is out of reach: with backpropagation one can only compute Jacobian-vector products, aka \(\jac_f(x) v\) for \(v \in \bbR^d\). To compute the full Jacobian means computing \(\jac_f(x) e_i\) for all canonical vectors \(e_i\), hence calling backpropagation \(d\) times (though it can be done in a vectorized fashion). But, to compute the trace of the Jacobian, one can sample a single \(z\), and then approximate the expectation in \eqref{eq:hutchinson} by a (single) Monte-Carlo estimate:
\[\tr J_f(x) \approx z^\top \left(J_f(x) z \right)\]where \(J_f(x) z\) is computed with a single backpropagation pass. This is what is done in Continuous normalizing flows, in the Ordinary Differential Equation required to evaluate the log likelihood loss. Very clever and elegant!
The Hutchinson estimator is a very elegant trick with a short proof, but it has a drawback: its variance is terrible.
Proposition 1: For standard Gaussian \(z\) the variance of the estimator \(z^\top A z\) is \(2 \norm{A}_F^2\).
This result is quite disappointing: the variance scales like \(\mathcal{O}(d^2)\), while the trace scales like \(d\). Can we improve this with other random variables? Yes, but not by much!
Proposition 2: For uniform sign \(z\) (aka Rademacher variables, taking values +1 or -1 with probability 1/2), the variance of the estimator is $2 \sum_{i \neq j} A_{ij}^2$. It is the law of \(z\) that minimizes the variance of the Hutchinson estimator.
In hindsight, this optimality result makes sense: if \(A\) is diagonal, then for any draw of a Rademacher \(z\), \(z^\top A z = \tr A\) !
With a small numerical experiment (entries of \(A \in \bbR^{20 \times 20}\) i.i.d. standard Gaussian) we can check that Rademacher gives a slightly smaller variance.
If we increase the diagonal entries of \(A\) (by 5 here), we see the variance for Gaussian \(z\) increase as predicted by Proposition 1, while the performance for Rademacher \(z\) is unaffected (coherent with Proposition 2)
As my colleague Antoine Gonon pointed to me, the Hutchinson trick can be used to compute all the entries of the diagonal:
\[\bbE_z [(Az) \odot z] = \diag(A)\]The proof is very similar to the first one. This technique allows evaluating the diagonal of the Hessian of a neural network \(g: \bbR \to \bbR\), using Hessian-vector product (see this great blog post), which has applications in neural network pruning for example. See more in Section 9.8 of