Assume there are two hypotheses and , r.v. ranged in alphabets . Under hypothesis , has pdf . According to Law of Total Probability, we have:
The formula can be transformed into:
which implies that, equals the difference of log likelihood ratio before and after conditioning . We define be the discrimination information for over , when . The expectation of discrimination information is KL divergence, denoted as:
which sometimes denoted as , or simply if without ambiguity.
KL Divergence can be interpreted as a measure of expected information for gained after distribution shifted from to , where and regarded as prior and post-prior distributions.
Further we may define the (symmetrised) divergence between and :
Divergence is a measure of difference between two distributions satisfying
- and equality holds iff ;
- .
but not triangle inequality, and thus is not a distance function.
#Generalization
We may generalize the definition to continuous variables as:
or multivariate cases:
or further, conditional cases:
#Relation to Shannon Entropy
For a discrete variable , with distribution , we have:
where is number of values can take on, and is the uniform distribution on those values. The lemma suggests that measures the information difference between and a totally chaotic “bottom distribution”. From the perspective of transmission, KL divergence equals the length of additional bits required to encode a message, when the distribution of source alphabets shifts from to .
For a continuous variable ranged within a finite set of volume , we have:
where is a uniform distribution over the range of . The interpretation is similar to discrete cases.
We can observe that
which suggests mutual information is the information difference when distribution of shifted from to , i.e., from independent to dependent.
#Additional Properties of KL Divergence
Addictivity Assume are independent, and , likewisely. We have
Convexity Assume are distributions, . We have
which can be derived from when . Similarly we have
which can be derived from the convexity of . Further we have
which can be proved using the Log Sum Inequality.
Invariance Assume , then , i.e., the form of KL divergence remains invariant under transforms. Mutual information, however, does not have such property.
Author: hsfzxjy.
Link: https://i.hsfzxjy.site/information-theory-kl-divergence/.
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.