Given a discrete r.v. $X$, where $X$ ranged in $\{a_1, \ldots, a_n\}$, $\mathbb{P}(X=a_k)=p_k$. Entropy $H(X)$ is defined as:

$$H(X)= - \sum_k p_k \log p_k$$When regarded as a function of $\{p_k\}$, entropy satisfies the following properties:

1. $H(p_1,\ldots,p_n)$ is continuous, and non-negative;
2. $H(p_1,\ldots,p_n)$ is convex w.r.t. $(p_1,\ldots,p_n)$;
3. $H(p_1,\ldots,p_n)$ has a unique maxima $(\frac{1}{n},\ldots,\frac{1}{n})$;
4. $H(n):=H(\frac{1}{n},\ldots,\frac{1}{n})$ increases along with $n$;
5. $H(p_1,\ldots,p_n)=H(p_1+\ldots+p_k,p_{k+1},\ldots,p_n)+(p_1+\ldots+p_k)H(p_{k+1}',\ldots,p_n')$.

Property 5 is so-called addictivity. That is, if we observe $X$ in two steps, firstly obtaining a value from $\{\hat{a},a_{k+1},\ldots,a_n\}$ and then another value from $\{a_1,\ldots,a_k\}$ if $\hat{a}$ selected, the entropy of the whole system should be sum of these two subsystems.

Note that a function satisfying property 1, 4, 5 must have a form of $H(\vec{p})= - C \sum_k p_k \log p_k$, which reveals that entropy function is unique.

Entropy measures the uncertainty of a random value. Intuitively, entropy reaches its maximum $\log n$ when all alphabets occur with same probability, and likewise has a minimum of $0$ if $p_k=1$ for some $k$.

Entropy also represents the smallest average length to encode a message. Say we have a message consisting of alphabets $a_1,\ldots,a_n$, occurring with probability $p_1,\ldots,p_n$. Now we want to assign a code (an $N$-ary string) to each alphabet, with no two codes sharing a same prefix. The length of the codes are denoted as $l_1,\ldots,l_n$. Shannon’s source coding theroem states that the average code length $\sum_k p_k l_k$ could not be less than $H(p_1,\ldots,p_n)$ (taking $N$ as logarithm base).

We may generalize the definition to 2D random variables likewisely. Suppose $\mathbb{P}((X,Y)=(a_k,b_j))$ $=p(a_k,b_j)$. We define: $$H(XY)=\sum_{k,j} - p(a_k,b_j) \log p(a_k,b_j)$$. Rewrite the formula as:

\begin{align} H(XY) &= \sum_{k,j} - p(a_k,b_j) \log p(a_k,b_j) \\ &= \sum_k\sum_j - p(a_k,b_j) \log \left[ p(b_j) p(a_k|b_j) \right] \\ &= \sum_j -p(b_j) \log p(b_j) - \sum_j p(b_j) \sum_k p(a_k|b_j) \log p(a_k|b_j) \\ &= H(Y) - \sum_j p(b_j) H(X|b_j) \\ &= H(Y) + H(X|Y) \\ \end{align}where $H(X|Y)$ is weighted average of $H(X|\cdot)$, called conditional entropy. Symmetrically, we have $H(X)+H(Y|X)=H(Y)+H(X|Y)$.

Further we have $H(XY) \leq H(X)+H(Y)$, since:

\begin{align} H(XY)-H(X)-H(Y) &= \sum_{k,j} p(a_k,b_j) \log \frac{p(a_k)p(b_j)}{p(a_k,b_j)} \\ &\leq \sum_{k,j} p(a_k,b_j) \left[ \frac{p(a_k)p(b_j)}{p(a_k,b_j)} - 1 \right] \\ &=0 \end{align}the equality holds $\iff X, Y$ are independent.

In summary, we have:

\begin{align} H(XY) &\leq H(X)+H(Y) \\ H(Y|X) &\leq H(Y) \\ H(X|Y) &\leq H(X) \\ \end{align}The latter two are intuitive — entropy (uncertainty) decreases if provided some prior.

## Discrete Mutual Information

The mutual information of $X,Y$ is defined as:

\begin{align} I(X|Y) &= H(X)-H(X|Y) \\ &= - \sum_k p(a_k) \log p(a_k) + \sum_j \sum_k p(a_k,b_j) \log p(a_k|b_j) \\ &= \sum_{j,k} p(a_k,b_j) \log \frac{p(a_k, b_j)}{p(a_k)p(b_j)} \end{align}Intuitively, $I(X;Y)$ is the reduced information after prior $Y$ provided. It’s worth noting that $I(X;Y)=I(Y;X)$ and thus $I(\cdot;\cdot)$ is symmetric.

We can define conditional mutual information likewisely:

\begin{align} I(X;Y|Z) &= \sum_i p(z_i) I(X|z_i;Y|z_i) \\ &= \sum_i p(z_i) \sum_{k,j} p(a_k,b_j|z_i) \log \frac{p(a_k, b_j|z_i)}{p(a_k|z_i)p(b_j|z_i)} \\ &= \sum_{i,j,k} p(a_k,b_j,z_i) \log \frac{p(a_k, b_j,z_i)p(z_i)}{p(a_k,z_i)p(b_j,z_i)} \end{align}

Some facts can be derived using the definition of entropy and mutual information:

\begin{align} I(X;Y) &= H(X) - H(X|Y) \\ &= H(X) + H(Y) - H(XY) \\ I(X;YZ) &= H(X) + H(YZ) - H(XYZ) \\ I(X;Y|Z) &= H(X|Z) - H(X|YZ) \\ &= H(XZ) + H(YZ) - H(Z) - H(XYZ) \end{align}

We can further define mutual information among three (or more) variables as:

\begin{align} I(X;Y;Z) &= I(X;Y)-I(X;Y|Z) \\ &= H(X)+H(Y)+H(Z)-H(XY)-H(YZ)-H(XZ)+H(XYZ) \\ \end{align}

We may observse some similarity between the above formula and Inclusion-Exclusion Principle. In fact, we can build a formal correspondence between entropy and set measure, as:

$$\begin{matrix} &H(X)=\mu(A) &H(XY)=\mu(A \cup B) \\ &H(X|Y)=\mu(A-B) &I(X;Y)=\mu(A \cap B) \\ &H(X;Y|Z)=\mu((A \cap B)-C) &I(X;Y;Z)=\mu(A \cap B \cap C) \end{matrix}$$where $\mu$ is some set measure. The correspondence may be used as a helper for memorizing.

## Continuous Cases

We first try to generalize entropy to continuous r.v.. Suppose $X$ has pdf $p(x)$, and $\pi=\{x_i\}_{-\infty}^\infty$ is a partition of $\mathbb{R}$, $|\pi|=\Delta x$. We may resemble discrete cases and write down the formula of “entropy”:

$$\sum_i - p(x_i) \log (p(x_i) \Delta x)$$Taking $\Delta x \rightarrow 0$, we have:

\begin{align} &\lim_{\Delta x \rightarrow 0}\sum_i - p(x_i) \log (p(x_i) \Delta x) \\ =&\int -p(x) \log p(x) dx - \lim_{\Delta x \rightarrow 0}\sum_i - p(x_i) \log \Delta x \end{align}In practice, we omit the second term (since it’s infinite) and define the first one as differential entropy:

$$h(X):=\int_\mathbb{R} -p(x) \log p(x) dx$$We can define $h(X|Y)$ and $I(X;Y)$ likewisely.