Skip to the content.

Understanding KL Divergence: Measuring One Distribution Against Another

Kullback–Leibler (KL) divergence quantifies how one probability distribution departs from another. It shows up everywhere—from maximum-likelihood training to variational inference—yet it can feel abstract. This post builds intuition, links KL to coding length, and walks through concrete examples you can compute by hand.

Table of Contents

Core Definition

For two distributions $P$ (“true”) and $Q$ (“approximate”) over the same support:

  • Discrete case: \(D_{\mathrm{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\)
  • Continuous case: \(D_{\mathrm{KL}}(P \Vert Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx\)

Units depend on the logarithm base: natural log → nats, base 2 → bits.

How to Interpret the Formula

  1. Expected surprise penalty: $\log \frac{P(x)}{Q(x)}$ is the extra surprise of seeing $x$ if you assume $Q$ when reality is $P$. KL takes the expectation under $P$, so it penalizes mismatches where $P$ assigns high probability.
  2. Coding length gap: If you compress data generated by $P$ using a code optimized for $Q$, your average codelength increases by $D_{\mathrm{KL}}(P \Vert Q)$ bits/nats per symbol.
  3. Asymmetry is the point: Swapping $P$ and $Q$ changes the expectation and thus the penalty pattern. The direction encodes which distribution sets the weighting.

Connection to Entropy and Cross-Entropy

KL is the difference between cross-entropy and entropy:

\[D_{\mathrm{KL}}(P \Vert Q) = H(P, Q) - H(P)\]

where $H(P) = -\sum_x P(x) \log P(x)$ and $H(P, Q) = -\sum_x P(x) \log Q(x)$. Minimizing KL w.r.t. $Q$ is equivalent to minimizing cross-entropy, because $H(P)$ is constant with respect to $Q$.

Discrete Example: Two Coins

Let the true coin be $P(\text{H}) = 0.7$, $P(\text{T}) = 0.3$. Suppose our model coin is $Q(\text{H}) = 0.4$, $Q(\text{T}) = 0.6$.

[ \begin{aligned} D_{\mathrm{KL}}(P \Vert Q) &= 0.7 \log \frac{0.7}{0.4} + 0.3 \log \frac{0.3}{0.6}
&\approx 0.7 \times 0.5596 + 0.3 \times (-0.6931)
&\approx 0.1526 \text{ nats} \approx 0.22 \text{ bits}. \end{aligned} ]

Interpretation: coding coin flips from $P$ with a code for $Q$ wastes about 0.22 extra bits per flip.

Continuous Case

For Gaussian distributions, KL has a closed form. For one-dimensional normals $P = \mathcal{N}(\mu_p, \sigma_p^2)$ and $Q = \mathcal{N}(\mu_q, \sigma_q^2)$:

\[D_{\mathrm{KL}}(P \Vert Q) = \log \frac{\sigma_q}{\sigma_p} + \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{2 \sigma_q^2} - \frac{1}{2}.\]

Notice how mean and variance mismatches both contribute. When both match, KL is zero.

Why KL Appears in Machine Learning

  • Maximum likelihood: Fitting $Q_\theta$ to data from $P$ by minimizing empirical cross-entropy is equivalent to minimizing $D_{\mathrm{KL}}(P \Vert Q_\theta)$.
  • Variational inference: The evidence lower bound (ELBO) rearranges to $\log p(x) = \text{ELBO} + D_{\mathrm{KL}}(q(z \mid x) \Vert p(z \mid x))$. Maximizing ELBO means making the approximate posterior $q$ close to the true posterior in KL.
  • Regularization and distillation: KL can act as a soft constraint, e.g., knowledge distillation minimizes $D_{\mathrm{KL}}(P_{\text{teacher}} \Vert P_{\text{student}})$ over predicted class distributions.
  • Reinforcement learning: Trust region and proximal policy optimization constrain $D_{\mathrm{KL}}(\pi_{\text{new}} \Vert \pi_{\text{old}})$ to keep policy updates stable.

Common Pitfalls and Practical Tips

  1. Support mismatch is catastrophic: If $Q(x) = 0$ where $P(x) > 0$, KL becomes infinite. Add smoothing or ensure overlapping support.
  2. Direction matters: Minimizing $D_{\mathrm{KL}}(P \Vert Q)$ (mode-covering) punishes missing mass more than extra mass. Minimizing $D_{\mathrm{KL}}(Q \Vert P)$ (mode-seeking) does the opposite.
  3. Scaling with units: Choose log base consistent with your application (nats for calculus-friendly gradients, bits for coding interpretations).
  4. Numerical stability: Clip probabilities away from 0, use log-sum-exp tricks, and compute in log-space when possible.

Key Takeaways

  • KL divergence is an expected extra surprise or extra codelength when using the wrong distribution.
  • It is non-negative and asymmetric; zero only when the distributions match exactly.
  • Many training objectives secretly minimize a KL, making it the connective tissue between likelihood, variational methods, distillation, and stable policy updates.

Keep Reading