On the link between maximizing the likelihood and minimizing the Kullback-Leibler divergence

This blog post clarifies the links between the maximizing the likelihood and minimizing the Kullback-Leibler divergence between a parametric distribution \(p_\theta\) and the true distribution \(\pdata\).

TLDR: the negative log-likelihood is not the KL between \(\pdatahat\) and \(p_\theta\). Instead, it is an empirical estimate of the KL between \(\pdata\) and \(p_\theta\). Those two are not the same.

Setup

Suppose that you have access to i.i.d. samples \(\data{1}, \ldots, \data{n}\) from the true data distribution \(\pdata\), the latter admitting a density with respect to the Lebesgue measure. Define the empirical data distribution as \(\pdatahat := \frac{1}{n}\sum_{i=1}^n \delta_{\data{i}}\), where \(\delta\) is the Dirac distribution. Among a family of parametric distributions \(\{p_\theta\}_\theta\), a natural approach to find the one closest to \(\pdata\), or fitting the data the best, is to maximise the likelihood of the observed samples, i.e. to solve

\[\max_\theta \sum_{i=1}^n \log p_\theta(\data{i})\]

An incorrect interpretation

You can read online (e.g. here) that maximizing the likelihood amounts to making \(p_\theta\) close to \(\pdatahat\), because it amounts to solving \(\min_\theta \KL(\pdatahat, p_\theta)\).

The reasoning is as follows:

\[\begin{align} \KL(\pdatahat, p_\theta) &= \int \pdatahat(x) \log \frac{\pdatahat(x)}{p_\theta(x)} dx \\ &= \int \pdatahat(x) \log \pdatahat(x) dx - \int \pdatahat(x) \log p_\theta(x) dx \\ &= constant - \int \pdatahat(x) \log p_\theta(x) dx \\ &= constant - \frac{1}{n} \sum_{i=1}^n \log p_\theta(\data{i}) \end{align}\]

but these manipulations are not correct, as \(\pdatahat\) does not admit a density: \(\int \pdatahat(x) \log \pdatahat(x) dx\) is not defined (worse, the KL itself is not defined unless \(p_\theta\) is supported only on the \(\data{i}\)’s, which is unlikely).

The correct interpretation

The correct intepretation is: maximizing the likelihood amounts to minimizing an empirical estimate of the KL between \(\pdata\) (not \(\pdatahat\)) and \(p_\theta\). Indeed the latter is:

\[\begin{align} \KL(\pdata, p_\theta) &= \int \pdata(x) \log \frac{\pdata(x)}{p_\theta(x)} dx \\ &= \int \pdata(x) \log \pdata(x) dx - \int \pdata(x) \log p_\theta(x) dx \\ &= constant - \int \pdata(x) \log p_\theta(x) dx \end{align}\]

Up to a constant, an empirical estimate of the KL divergence is therefore:

\[\begin{align} \widehat{\KL}(\pdata, p_\theta) = -\int \pdatahat(x) \log p_\theta(x) dx \end{align}\]

which is equal to the negative likelihood.