# Maximum Likelihood or KL Divergence

## 2019/11/24

Let’s assume we have a random variable $X$ that is distributed according to the true probability density function $p$. We write $x \sim p$, or $x \sim p(x)$, as the realization of $X$ when sampling from $p$, and $p(x)$ short for $p(X=x)$.

Given a dataset of independent and identically distributed (i.i.d.) samples $D = \{ x_i \}_{i=1 \ldots N}$, with $x_i \sim p$, we often try to fit a parameterized distribution $p(x | \theta)$ to the data. For that we maximize the log-likelihood of the set of parameters $\theta$

\begin{align} & \max_\theta \log L(D | \theta) = \max_\theta \log \prod_{i=1}^N p(x_i | \theta) = \max_\theta \sum_{i=1}^N \log p(x_i | \theta) \end{align}

Let’s see how this is equivalent to minimizing the Kullback Leibler (KL) divergence between the parameterized distribution and the true one. We minimize the forward version of the KL such that the parameterized distribution covers the space where the true one has density. (To determine whether to use the forward or reverse KL I found this post to be enlightening.)

\begin{align} & \min_\theta \textrm{D}_{\textrm{KL}} \left[ p(x) || p(x | \theta) \right] \\ = & \min_\theta \mathbb{E}_{x \sim p(x)}\left[ -\log p(x | \theta) \right] - \mathbb{E}_{x \sim p(x)}\left[ -\log p(x) \right] \\ = & \min_\theta -\int p(x)\log p(x|\theta) \textrm{d}x \\ = & \max_\theta \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N \log p(x_i|\theta) \textrm{, with } x_i \sim p \end{align}

Hence we see that both formulations yield the same parameters when the number of samples grows to infinity. But why do maximum likelihood estimation instead of minimizing the KL divergence directly? The problem is that we cannot solve analytically the integral $\int p(x)\log p(x|\theta) \textrm{d}x$, because we do not have access to the true distribution $p(x)$… After all this is what we are trying to find.

For the sake of intuition, writting the KL divergence as $\textrm{D}_{\textrm{KL}} \left[ p(x) || p(x| \theta) \right] = \mathbb{E}_{x \sim p(x)}\left[ -\log p(x | \theta) \right] - \mathbb{E}_{x \sim p(x)}\left[ -\log p(x) \right]$, and noting that the second term is just the entropy of the true distribution, helps us understand that the parameters $\theta$ are chosen such that we minimize the extra needed bits (when using $\log_2$) to code a symbol from $X$ when it is sampled with $p(x | \theta)$ instead of $p(x)$.