Information Theory: KL Divergence

Assume there are two hypotheses and , r.v. ranged in alphabets . Under hypothesis , has pdf . According to Law of Total Probability, we have:

The formula can be transformed into:

which implies that, equals the difference of log likelihood ratio before and after conditioning . We define be the discrimination information for over , when . The expectation of discrimination information is KL divergence, denoted as:

which sometimes denoted as , or simply if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for gained after distribution shifted from to , where and regarded as prior and post-prior distributions.

Further we may define the (symmetrised) divergence between and :

Divergence is a measure of difference between two distributions satisfying

  • and equality holds iff ;
  • .

but not triangle inequality, and thus is not a distance function.

#Generalization

We may generalize the definition to continuous variables as:

or multivariate cases:

or further, conditional cases:

#Relation to Shannon Entropy

For a discrete variable , with distribution , we have:

where is number of values can take on, and is the uniform distribution on those values. The lemma suggests that measures the information difference between and a totally chaotic “bottom distribution”. From the perspective of transmission, KL divergence equals the length of additional bits required to encode a message, when the distribution of source alphabets shifts from to .

For a continuous variable ranged within a finite set of volume , we have:

where is a uniform distribution over the range of . The interpretation is similar to discrete cases.

#Relation to Mutual Information

We can observe that

which suggests mutual information is the information difference when distribution of shifted from to , i.e., from independent to dependent.

#Additional Properties of KL Divergence

Addictivity Assume are independent, and , likewisely. We have

Convexity Assume are distributions, . We have

which can be derived from when . Similarly we have

which can be derived from the convexity of . Further we have

which can be proved using the Log Sum Inequality.

Invariance Assume , then , i.e., the form of KL divergence remains invariant under transforms. Mutual information, however, does not have such property.


Author: hsfzxjy.
Link: https://i.hsfzxjy.site/information-theory-kl-divergence/.
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

«Obtain a Random Available TCP Port with Bash