Mutual Information

In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols $A=\{a_{1},a_{2},\ldots ,a_{n}\}$ , and output symbols $B=\{b_{1},b_{2},\ldots ,b_{m}\}$ . Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol $b_{j}$ , we are not sure which $a_{i}$ was the input symbol. We can then characterize the channel as a set of probabilities $\{P\left(a_{i}\mid b_{j}\right)\}$ . Let us consider the information we get from observing a symbol $b_{j}$ .

Definition

Figure 1: A noisy channel.

Given a probability model of the source, we have an a priori estimate $P\left(a_{i}\right)$ that symbol $a_{i}$ will be sent next. Upon observing $b_{j}$ , we can revise our estimate to $P\left(a_{i}\mid b_{j}\right)$ , as shown in Fig. 1. The change in information, or mutual information, is given by:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)-\log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)}}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(1)

Let's look at a few properties of mutual information. Expressing the equation above in terms of $I\left(a_{i}\right)$ :

I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)+\log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)

(2)

Thus, we can say:

I\left(a_{i};b_{j}\right)\leq I\left(a_{i}\right)

(3)

Figure 2: An information channel.

This is expected since, after observing $b_{j}$ , the amount of uncertainty is reduced, i.e. we know a bit more about $a_{i}$ , and the most change in information we can get is when $a_{i}$ and $b_{j}$ are perfectly correlated, with $I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)$ . Thus, we can think of mutual information as the average information conveyed across the channel, as shown in Fig. 2. From Bayes' Theorem, we have the property:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=I\left(b_{j};a_{i}\right)

(4)

Note that if $a_{i}$ and $b_{j}$ are independent, where $P\left(a_{i}\mid b_{j}\right)=P\left(a_{i}\right)$ and $P\left(b_{j}\mid a_{i}\right)=P\left(b_{j}\right)$ , then:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=\log _{2}\left(1\right)=0

(5)

We can get the average mutual information over all the input symbols as:

I\left(A;b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot I\left(a_{i};b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(6)

Similarly, for all the output symbols:

I\left(a_{i};B\right)=\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)

(7)

For both input and output symbols, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot I\left(a_{i};B\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=I\left(B;A\right)\end{aligned}}

(8)

Non-Negativity of Mutual Information

To show the non-negativity of mutual information, let us use Jensen's Inequality, which states that for a convex function, $f\left(x\right)$ :

\langle f\left(x\right)\rangle \geq f\left(\langle x\rangle \right)

(9)

Using the fact that $f\left(x\right)=-\log _{2}\left(x\right)$ is convex, and applying this to our expression for mutual information, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&=\left\langle -\log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\right\rangle \\&\geq -\log _{2}\left(\left\langle {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right\rangle \right)=-\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&\geq -\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)\cdot P\left(b_{j}\right)\right)=-\log _{2}\left(\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\right)\right)=-\log _{2}\left(1\right)\\&\geq 0\\\end{aligned}}

(10)

Note that $I\left(A;B\right)=0$ when $A$ and $B$ are independent.

Conditional and Joint Entropy

Given $A$ and $B$ , and their entropies:

H\left(A\right)=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)

(11)

H\left(B\right)=\sum _{j=1}^{m}P\left(b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)

(12)

Conditional Entropy

The conditional entropy is a measure of the average uncertainty about $B$ when $A$ is known, and we can define it as:

{\begin{aligned}H\left(B\mid A\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\\end{aligned}}

(13)

And similarly,

{\begin{aligned}H\left(A\mid B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\&\neq H\left(B\mid A\right)\\\end{aligned}}

(14)

Joint Entropy

If we extend the definition of entropy to two (or more) random variables, $A$ and $B$ , we can define the joint entropy of $A$ and $B$ as:

H\left(A,B\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)

(15)

Expanding expression for joint entropy, and using $P\left(a_{i},b_{j}\right)=P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)$ we get:

{\begin{aligned}H\left(A,B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)}}\right)=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\left(\log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)+\log _{2}\left(P\left(b_{j}\right)\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(b_{j}\right)\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{j=1}^{m}\left(\sum _{i=1}^{n}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=H\left(A\mid B\right)+\sum _{j=1}^{m}P\left(b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=H\left(A\mid B\right)+H\left(B\right)\end{aligned}}

(16)

If we instead used $P\left(a_{i},b_{j}\right)=P\left(b_{j}\mid a_{i}\right)P\left(a_{i}\right)$ , we would get the alternative expression:

H\left(A,B\right)=H\left(B\mid A\right)+H\left(A\right)

(17)

We can then expand our expression for $I\left(A;B\right)$ as:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \left(\log _{2}\left(P\left(a_{i},b_{j}\right)\right)-\log _{2}\left(P\left(a_{i}\right)\right)-\log _{2}\left(P\left(b_{j}\right)\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{i=1}^{n}\left(\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)+\sum _{j=1}^{m}\left(\sum _{i=1}^{n}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=-H\left(A,B\right)+H\left(A\right)+H\left(B\right)\\&=H\left(A\right)-H\left(A\mid B\right)\\&=H\left(B\right)-H\left(B\mid A\right)\\\end{aligned}}

(18)

Sources

Tom Carter's notes on Information Theory
Dan Hirschberg's notes on Data Compression

Mutual Information

Contents

Definition

Non-Negativity of Mutual Information

Conditional and Joint Entropy

Conditional Entropy

Joint Entropy

Sources

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools