Understanding KL Divergence: Measuring One Distribution Against Another

March 20, 2025· probability, information-theory, machine-learning

Kullback–Leibler (KL) divergence quantifies how one probability distribution departs from another. It shows up everywhere—from maximum-likelihood training to variational inference—yet it can feel abstract. This post builds intuition, links KL to coding length, and walks through concrete examples you can compute by hand.

Core Definition
How to Interpret the Formula
Connection to Entropy and Cross-Entropy
Discrete Example: Two Coins
Continuous Case
Why KL Appears in Machine Learning
Common Pitfalls and Practical Tips
Key Takeaways

Core Definition

For two distributions $P$ (“true”) and $Q$ (“approximate”) over the same support:

Discrete case: $D_{\mathrm{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$
Continuous case: $D_{\mathrm{KL}}(P \Vert Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx$

Units depend on the logarithm base: natural log → nats, base 2 → bits.

How to Interpret the Formula

Expected surprise penalty: $\log \frac{P(x)}{Q(x)}$ is the extra surprise of seeing $x$ if you assume $Q$ when reality is $P$. KL takes the expectation under $P$, so it penalizes mismatches where $P$ assigns high probability.
Coding length gap: If you compress data generated by $P$ using a code optimized for $Q$, your average codelength increases by $D_{\mathrm{KL}}(P \Vert Q)$ bits/nats per symbol.
Asymmetry is the point: Swapping $P$ and $Q$ changes the expectation and thus the penalty pattern. The direction encodes which distribution sets the weighting.

Connection to Entropy and Cross-Entropy

KL is the difference between cross-entropy and entropy:

\[D_{\mathrm{KL}}(P \Vert Q) = H(P, Q) - H(P)\]

where $H(P) = -\sum_x P(x) \log P(x)$ and $H(P, Q) = -\sum_x P(x) \log Q(x)$. Minimizing KL w.r.t. $Q$ is equivalent to minimizing cross-entropy, because $H(P)$ is constant with respect to $Q$.

Discrete Example: Two Coins

Let the true coin be $P(\text{H}) = 0.7$, $P(\text{T}) = 0.3$. Suppose our model coin is $Q(\text{H}) = 0.4$, $Q(\text{T}) = 0.6$.

[ \begin{aligned} D_{\mathrm{KL}}(P \Vert Q) &= 0.7 \log \frac{0.7}{0.4} + 0.3 \log \frac{0.3}{0.6}
&\approx 0.7 \times 0.5596 + 0.3 \times (-0.6931)
&\approx 0.1526 \text{ nats} \approx 0.22 \text{ bits}. \end{aligned} ]

Interpretation: coding coin flips from $P$ with a code for $Q$ wastes about 0.22 extra bits per flip.

Continuous Case

For Gaussian distributions, KL has a closed form. For one-dimensional normals $P = \mathcal{N}(\mu_p, \sigma_p^2)$ and $Q = \mathcal{N}(\mu_q, \sigma_q^2)$:

\[D_{\mathrm{KL}}(P \Vert Q) = \log \frac{\sigma_q}{\sigma_p} + \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{2 \sigma_q^2} - \frac{1}{2}.\]

Notice how mean and variance mismatches both contribute. When both match, KL is zero.

Why KL Appears in Machine Learning

Maximum likelihood: Fitting $Q_\theta$ to data from $P$ by minimizing empirical cross-entropy is equivalent to minimizing $D_{\mathrm{KL}}(P \Vert Q_\theta)$.
Variational inference: The evidence lower bound (ELBO) rearranges to $\log p(x) = \text{ELBO} + D_{\mathrm{KL}}(q(z \mid x) \Vert p(z \mid x))$. Maximizing ELBO means making the approximate posterior $q$ close to the true posterior in KL.
Regularization and distillation: KL can act as a soft constraint, e.g., knowledge distillation minimizes $D_{\mathrm{KL}}(P_{\text{teacher}} \Vert P_{\text{student}})$ over predicted class distributions.
Reinforcement learning: Trust region and proximal policy optimization constrain $D_{\mathrm{KL}}(\pi_{\text{new}} \Vert \pi_{\text{old}})$ to keep policy updates stable.

Common Pitfalls and Practical Tips

Support mismatch is catastrophic: If $Q(x) = 0$ where $P(x) > 0$, KL becomes infinite. Add smoothing or ensure overlapping support.
Direction matters: Minimizing $D_{\mathrm{KL}}(P \Vert Q)$ (mode-covering) punishes missing mass more than extra mass. Minimizing $D_{\mathrm{KL}}(Q \Vert P)$ (mode-seeking) does the opposite.
Scaling with units: Choose log base consistent with your application (nats for calculus-friendly gradients, bits for coding interpretations).
Numerical stability: Clip probabilities away from 0, use log-sum-exp tricks, and compute in log-space when possible.

Key Takeaways

KL divergence is an expected extra surprise or extra codelength when using the wrong distribution.
It is non-negative and asymmetric; zero only when the distributions match exactly.
Many training objectives secretly minimize a KL, making it the connective tissue between likelihood, variational methods, distillation, and stable policy updates.

Keep Reading

What Actually Happens When You Set Model Weights to Zero (and Why Gradients Still Work) February 14, 2026 · machine-learning, deep-learning, autograd, pytorch
Taylor Series Expansion: A Local Lens for Functions February 13, 2026 · taylor-series, calculus, optimization, machine-learning, applied-mathematics
Learning Rate Schedulers: Intuition, Tradeoffs, and When to Use Which February 3, 2026 · machine-learning, optimization, deep-learning
Dijkstra’s Algorithm and Where Machine Learning Uses It February 2, 2026 · graphs, algorithms, machine-learning, optimization
How Variational Autoencoders Avoid Computing the Partition Function January 1, 2026 · deep-learning, generative-models, vae, probability, machine-learning

Understanding KL Divergence: Measuring One Distribution Against Another

A practical walkthrough of Kullback–Leibler divergence—what it measures, why it is asymmetric, and how it powers log-likelihood training and variational inference.

Understanding KL Divergence: Measuring One Distribution Against Another

Table of Contents

Core Definition

How to Interpret the Formula

Connection to Entropy and Cross-Entropy

Discrete Example: Two Coins

Continuous Case

Why KL Appears in Machine Learning

Common Pitfalls and Practical Tips

Key Takeaways

Keep Reading