Assume there are two hypotheses $H_1$ and $H_2$, r.v. $X$ ranged in alphabets $\{a_1,\ldots\,a_k\}$. Under hypothesis $H_i$, $X$ has pdf $p(X=a_j|H_i)=p_i(a_j)$. According to Law of Total Probability, we have:

$$ p(H_i|a_k) = \frac{p(H_i)p_i(a_k)}{p_1(a_k)p(H_1)+p_2(a_k)p(H_2)} $$The formula can be transformed into:

$$ \log \frac{p_2(a_k)}{p_1(a_k)} = \log \frac{p(H_2|a_k)}{p(H_1|a_k)} - \log \frac{p(H_2)}{p(H_1)} $$which implies that, $\log \frac{p_2(a_k)}{p_1(a_k)}$ equals the difference of log likelihood ratio before and after conditioning $X=a_k$. We define $\log \frac{p_2(a_k)}{p_1(a_k)}$ be the discrimination information for $H_2$ over $H_1$, when $X=a_k$. The expectation of discrimination information is KL divergence, denoted as:

$$D_{KL}(P_2||P_1) = \sum_k p_2(a_k) \log \frac{p_2(a_k)}{p_1(a_k)} $$which sometimes denoted as $I(p2,p1;X)$, or simply $I(p2,p1)$ if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for $X$ gained after distribution shifted from $p_1$ to $p_2$, where $p_1$ and $p_2$ regarded as prior and post-prior distributions.

Further we may define the (symmetrised) divergence between $p_2$ and $p_1$:

$$J(p_2,p_1;X)=D_{KL}(p_2||p_1) + D_{KL}(p_1||p_2) = J(p_1,p_2;X)$$Divergence $J(p_2,p_1)$ is a measure of difference between two distributions satisfying

- $J(p_1,p_2) \geq 0$ and equality holds iff $p_1=p_2$;
- $J(p_1,p_2) = J(p_2,p_1)$.

but not triangle inequality, and thus $J(\cdot,\cdot)$ is not a distance function.

## Generalization

We may generalize the definition to continuous variables as:

$$D_{KL}(p_2||p_1) = \int p_2(x) \log \frac{p_2(x)}{p_1(x)} dx$$or multivariate cases:

$$D_{KL}(p_2(\vec{x})||p_1(\vec{x})) = \int p_2(\vec{x}) \log \frac{p_2(\vec{x})}{p_1(\vec{x})} d\vec{x}$$or further, conditional cases:

$$D_{KL}(p_2(X|Y)||p_1(X|Y)) = \int p_2(y) p_2(x|y) \log \frac{p_2(x|y)}{p_1(x|y)} dxdy$$## Relation to Shannon Entropy

For a discrete variable $X$, with distribution $p$, we have:

$$H(X)=\log n - D_{KL}(p||v)$$where $n$ is number of values $X$ can take on, and $v$ is the uniform distribution on those values. The lemma suggests that $D_{KL}(p||v)$ measures the information difference between $p$ and a totally chaotic “bottom distribution”. From the perspective of transmission, KL divergence equals the length of additional bits required to encode a message, when the distribution of source alphabets shifts from $p$ to $v$.

For a continuous variable $X$ ranged within a finite set of volume $L$, we have:

$$H(X) = \log L - D_{KL}(p||v)$$where $v$ is a uniform distribution over the range of $X$. The interpretation is similar to discrete cases.

## Relation to Mutual Information

We can observe that

$$I(X;Y)=D_{KL}(P(XY)||P(X)P(Y))$$which suggests mutual information $I(X;Y)$ is the information difference when distribution of $(X,Y)$ shifted from $P(X)P(Y)$ to $P(X,Y)$, i.e., from independent to dependent.

## Additional Properties of KL Divergence

**Addictivity** Assume $X,Y$ are independent, and $P(x,y)=P_1(x)P_2(y)$, $Q$ likewisely. We have

**Convexity** Assume $p_1,p_2,q_1,q_2,p,q$ are distributions, $0 \leq \lambda \leq 1$. We have

which can be derived from $\log x \leq x - 1$ when $0 < x < 1$. Similarly we have

$$D_{KL}(p || \lambda q_1 + (1 - \lambda) q_2) \leq \lambda D_{KL}(p||q_1) + (1 - \lambda) D_{KL}(p||q_2)$$which can be derived from the convexity of $\log(\cdot)$. Further we have

$$D_{KL}(\lambda p_1 + (1 - \lambda) p_2 || \lambda q_1 + (1 - \lambda) q_2) \leq \lambda D_{KL}(p_1 || q_1) + (1 - \lambda) D_{KL}(p_2|| q_2)$$which can be proved using the Log Sum Inequality.

**Invariance** Assume $U=g(X)$, then $D_{KL}(p(X)||q(X))=D_{KL}(p(U)||q(U))$, i.e., the form of KL divergence remains invariant under transforms. Mutual information, however, does not have such property.

## OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.