Difference between revisions of "Mutual Information"

Latest revision as of 16:19, 29 September 2020

In general, the channel itself can add noise. This means that the channel adds an additional layer of uncertainty to our transmissions. Consider a channel with input symbols $A=\{a_{1},a_{2},\ldots ,a_{n}\}$ , and output symbols $B=\{b_{1},b_{2},\ldots ,b_{m}\}$ . Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol $b_{j}$ , we are not sure which $a_{i}$ was the input symbol.

We can then characterize the discrete channel as a set of probabilities $\{P\left(a_{i}\mid b_{j}\right)\}$ . If the probability distribution of the outputs depend on the current input, then the channel is memoryless. Let us consider the information we get from observing a symbol $b_{j}$ at the output of a discrete memoryless channel (DMC).

Definition

Figure 1: A noisy channel.

Given a probability model of the source, we have an a priori estimate $P\left(a_{i}\right)$ that symbol $a_{i}$ will be sent next. Upon observing $b_{j}$ , we can revise our estimate to $P\left(a_{i}\mid b_{j}\right)$ , as shown in Fig. 1. The change in information, or mutual information, is given by:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)-\log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)}}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(1)

Let's look at a few properties of mutual information. Expressing the equation above in terms of $I\left(a_{i}\right)$ :

I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)+\log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)

(2)

Thus, we can say:

I\left(a_{i};b_{j}\right)\leq I\left(a_{i}\right)

(3)

Figure 2: An information channel.

This is expected since, after observing $b_{j}$ , the amount of uncertainty is reduced, i.e. we know a bit more about $a_{i}$ , and the most change in information we can get is when $a_{i}$ and $b_{j}$ are perfectly correlated, with $I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)$ . Thus, we can think of mutual information as the average information conveyed across the channel, as shown in Fig. 2. From Bayes' Theorem, we have the property:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=I\left(b_{j};a_{i}\right)

(4)

Note that if $a_{i}$ and $b_{j}$ are independent, where $P\left(a_{i}\mid b_{j}\right)=P\left(a_{i}\right)$ and $P\left(b_{j}\mid a_{i}\right)=P\left(b_{j}\right)$ , then:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=\log _{2}\left(1\right)=0

(5)

We can get the average mutual information over all the input symbols as:

I\left(A;b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot I\left(a_{i};b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(6)

Similarly, for all the output symbols:

I\left(a_{i};B\right)=\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)

(7)

For both input and output symbols, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot I\left(a_{i};B\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=I\left(B;A\right)\end{aligned}}

(8)

Non-Negativity of Mutual Information

To show the non-negativity of mutual information, let us use Jensen's Inequality, which states that for a convex function, $f\left(x\right)$ :

\langle f\left(x\right)\rangle \geq f\left(\langle x\rangle \right)

(9)

Using the fact that $f\left(x\right)=-\log _{2}\left(x\right)$ is convex, and applying this to our expression for mutual information, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&=\left\langle -\log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\right\rangle \\&\geq -\log _{2}\left(\left\langle {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right\rangle \right)=-\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&\geq -\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)\cdot P\left(b_{j}\right)\right)=-\log _{2}\left(\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\right)\right)=-\log _{2}\left(1\right)\\&\geq 0\\\end{aligned}}

(10)

Note that $I\left(A;B\right)=0$ when $A$ and $B$ are independent.

Conditional and Joint Entropy

Given $A$ and $B$ , and their entropies:

H\left(A\right)=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)

(11)

H\left(B\right)=\sum _{j=1}^{m}P\left(b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)

(12)

Conditional Entropy

The conditional entropy is a measure of the average uncertainty about $B$ when $A$ is known, and we can define it as:

{\begin{aligned}H\left(B\mid A\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\\end{aligned}}

(13)

And similarly,

{\begin{aligned}H\left(A\mid B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\&\neq H\left(B\mid A\right)\\\end{aligned}}

(14)

Joint Entropy

If we extend the definition of entropy to two (or more) random variables, $A$ and $B$ , we can define the joint entropy of $A$ and $B$ as:

H\left(A,B\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)

(15)

Expanding expression for joint entropy, and using $P\left(a_{i},b_{j}\right)=P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)$ we get:

{\begin{aligned}H\left(A,B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)}}\right)=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\left(\log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)+\log _{2}\left(P\left(b_{j}\right)\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left(P\left(b_{j}\right)\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{j=1}^{m}\left(\sum _{i=1}^{n}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=H\left(A\mid B\right)+\sum _{j=1}^{m}P\left(b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=H\left(A\mid B\right)+H\left(B\right)\end{aligned}}

(16)

If we instead used $P\left(a_{i},b_{j}\right)=P\left(b_{j}\mid a_{i}\right)P\left(a_{i}\right)$ , we would get the alternative expression:

H\left(A,B\right)=H\left(B\mid A\right)+H\left(A\right)

(17)

We can then expand our expression for $I\left(A;B\right)$ as:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \left(\log _{2}\left(P\left(a_{i},b_{j}\right)\right)-\log _{2}\left(P\left(a_{i}\right)\right)-\log _{2}\left(P\left(b_{j}\right)\right)\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)+\sum _{i=1}^{n}\left(\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)+\sum _{j=1}^{m}\left(\sum _{i=1}^{n}P\left(a_{i},b_{j}\right)\right)\cdot \log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)\\&=-H\left(A,B\right)+H\left(A\right)+H\left(B\right)\\&=H\left(A\right)-H\left(A\mid B\right)\\&=H\left(B\right)-H\left(B\mid A\right)\\\end{aligned}}

(18)

We can then think of mutual information as the reduction in uncertainty due to another random variable. The above relationships between mutual information and the entropies are illustrated in Fig. 2. Note that $H\left(A\mid A\right)=0$ since $P\left(a_{i}\mid a_{i}\right)=1$ . We can then write:

I\left(A;A\right)=H\left(A\right)-H\left(A\mid A\right)=H\left(A\right)

(19)

Thus, we can think of entropy as self-information.

Channel Capacity

The maximum amount of information that can be transmitted through a discrete memoryless channel, or the channel capacity, with units bits per channel use, can then be thought of as the maximum mutual information over all possible input probability distributions:

C=\max _{P\left(A\right)}I\left(A;B\right)

(20)

Or equivalently, we need to choose $\{P\left(a_{i}\right)\}$ such that we maximize $I\left(A;B\right)$ . Since:

I\left(A;B\right)=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot I\left(a_{i};B\right)

(21)

And if we are using the channel at its capacity, then for every $a_{i}$ :

I\left(a_{i};B\right)=C

(22)

Thus, we can maximize channel use by maximizing the use for each symbol independently. From the definition of mutual information and from the Gibbs inequality, we can see that:

C\leq H\left(A\right),H\left(B\right)\leq \log _{2}\left(n\right),\log _{2}\left(m\right)

(23)

Where $n$ and $m$ are the number of symbols in $A$ and $B$ respectively. Thus, the channel capacity of a channel is limited by the logarithm of the number of distinguishable symbols at its input (or output).

Activity A3.1 Channel Capacity -- This activity introduces the concept of mutual information and channel capacity in noisy channels.

Sources

Tom Carter's notes on Information Theory
Dan Hirschberg's notes on Data Compression
Lance Williams' notes on Geometric and Probabilistic Methods in Computer Science

@@ Line 1: / Line 1: @@
-In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols <math>A=\{a_1, a_2, \ldots, a_n\}</math>, and output symbols <math>B=\{b_1, b_2, \ldots, b_m\}</math>. Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol <math>b_j</math>, we are not sure which <math>a_i</math> was the input symbol. We can then characterize the channel as a set of probabilities <math>\{P\left(a_i\mid b_j\right)\}</math>. Let us consider the information we get from observing a symbol <math>b_j</math>.
+In general, the channel itself can add noise. This means that the channel adds an additional layer of uncertainty to our transmissions. Consider a channel with input symbols <math>A=\{a_1, a_2, \ldots, a_n\}</math>, and output symbols <math>B=\{b_1, b_2, \ldots, b_m\}</math>. Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol <math>b_j</math>, we are not sure which <math>a_i</math> was the input symbol.
-==== Definition ====
+We can then characterize the ''discrete'' channel as a set of probabilities <math>\{P\left(a_i\mid b_j\right)\}</math>. If the probability distribution of the outputs depend on the current input, then the channel is ''memoryless''. Let us consider the information we get from observing a symbol <math>b_j</math> at the output of a ''discrete memoryless channel'' (DMC).
-Given a probability model of the source, we have an ''a priori'' estimate <math>P\left(a_i\right)</math> that symbol <math>a_i</math> will be sent next. Upon observing <math>b_j</math>, we can revise our estimate to <math>P\left(a_i\mid b_j\right)</math>. The change in information, or ''mutual information'', is given by:
-{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{1}{P\left(a_i\right)}\right)-\log_2\left(\frac{1}{P\left(a_i \mid b_j\right)}\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|13}}}}
+== Definition ==
+[[File:Noisy channel.png|thumb|500px|Figure 1: A noisy channel.]]
+Given a probability model of the source, we have an ''a priori'' estimate <math>P\left(a_i\right)</math> that symbol <math>a_i</math> will be sent next. Upon observing <math>b_j</math>, we can revise our estimate to <math>P\left(a_i\mid b_j\right)</math>, as shown in Fig. 1. The change in information, or ''mutual information'', is given by:
+{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{1}{P\left(a_i\right)}\right)-\log_2\left(\frac{1}{P\left(a_i \mid b_j\right)}\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|1}}}}
 Let's look at a few properties of mutual information. Expressing the equation above in terms of <math>I\left(a_i\right)</math>:
-{{NumBlk|::|<math>I\left(a_i ; b_j\right)=I\left(a_i\right) + \log_2\left(P\left(a_i \mid b_j\right)\right)</math>|{{EquationRef|14}}}}
+{{NumBlk|::|<math>I\left(a_i ; b_j\right)=I\left(a_i\right) + \log_2\left(P\left(a_i \mid b_j\right)\right)</math>|{{EquationRef|2}}}}
 Thus, we can say:
-{{NumBlk|::|<math>I\left(a_i ; b_j\right)\leq I\left(a_i\right)</math>|{{EquationRef|15}}}}
+{{NumBlk|::|<math>I\left(a_i ; b_j\right)\leq I\left(a_i\right)</math>|{{EquationRef|3}}}}
-This is expected since, after observing <math>b_j</math>, the amount of uncertainty is reduced, i.e. we know a bit more about <math>a_i</math>, and the most change in information we can get is when <math>a_i</math> and <math>b_j</math> are perfectly correlated, with <math>I\left(a_i ; b_j\right)= I\left(a_i\right)</math>. From Bayes' Theorem, we have the property:
+[[File:Information channel.png|thumb|500px|Figure 2: An information channel.]]
+This is expected since, after observing <math>b_j</math>, the amount of uncertainty is reduced, i.e. we know a bit more about <math>a_i</math>, and the most change in information we can get is when <math>a_i</math> and <math>b_j</math> are perfectly correlated, with <math>I\left(a_i ; b_j\right)= I\left(a_i\right)</math>. Thus, we can think of mutual information as the average information conveyed across the channel, as shown in Fig. 2. From Bayes' Theorem, we have the property:
-{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)=\log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right)=I\left(b_j ; a_i\right)</math>|{{EquationRef|16}}}}
+{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)=\log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right)=I\left(b_j ; a_i\right)</math>|{{EquationRef|4}}}}
 Note that if <math>a_i</math> and <math>b_j</math> are independent, where <math>P\left(a_i\mid b_j\right) = P\left(a_i\right)</math> and <math>P\left(b_j\mid a_i\right) = P\left(b_j\right)</math>, then:
-{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right) = \log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right) = \log_2\left(1\right)= 0</math>|{{EquationRef|17}}}}
+{{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right) = \log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right) = \log_2\left(1\right)= 0</math>|{{EquationRef|5}}}}
 We can get the average mutual information over all the input symbols as:
-{{NumBlk|::|<math>I\left(A ; b_j\right)= \sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot I\left(a_i;b_j\right)=\sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot \log_2\left(\frac{P\left(a_i\mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|18}}}}
+{{NumBlk|::|<math>I\left(A ; b_j\right)= \sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot I\left(a_i;b_j\right)=\sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot \log_2\left(\frac{P\left(a_i\mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|6}}}}
 Similarly, for all the output symbols:
-{{NumBlk|::|<math>I\left(a_i ; B\right)= \sum_{j=1}^m P\left(b_j\mid a_i\right)\cdot  \log_2\left(\frac{P\left(b_j\mid a_i\right)}{P\left(b_j\right)}\right)</math>|{{EquationRef|19}}}}
+{{NumBlk|::|<math>I\left(a_i ; B\right)= \sum_{j=1}^m P\left(b_j\mid a_i\right)\cdot  \log_2\left(\frac{P\left(b_j\mid a_i\right)}{P\left(b_j\right)}\right)</math>|{{EquationRef|7}}}}
 For both input and output symbols, we get:
@@ Line 36: / Line 40: @@
 & = \sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot \log_2\left(\frac{P\left( a_i, b_j\right)}{P\left(a_i\right)\cdot P\left(b_j\right)}\right) \\
 & = I\left(B ; A\right)
-\end{align}</math>|{{EquationRef|20}}}}
+\end{align}</math>|{{EquationRef|8}}}}
-==== Non-Negativity of Mutual Information ====
+== Non-Negativity of Mutual Information ==
 To show the non-negativity of mutual information, let us use ''Jensen's Inequality'', which states that for a convex function, <math>f\left(x\right)</math>:
-{{NumBlk|::|<math>\langle f\left(x\right)\rangle \ge f\left(\langle x\rangle\right)</math>|{{EquationRef|21}}}}
+{{NumBlk|::|<math>\langle f\left(x\right)\rangle \ge f\left(\langle x\rangle\right)</math>|{{EquationRef|9}}}}
 Using the fact that <math>f\left(x\right)=-\log_2\left( x\right)</math> is convex, and applying this to our expression for mutual information, we get:
@@ Line 54: / Line 58: @@
 = -\log_2\left(\sum_{i=1}^n P\left(a_i\right) \sum_{j=1}^m P\left(b_j\right)\right) = -\log_2\left(1\right) \\
 & \ge 0\\
-\end{align}</math>|{{EquationRef|22}}}}
+\end{align}</math>|{{EquationRef|10}}}}
 Note that <math>I\left(A ; B\right) =0</math> when <math>A</math> and <math>B</math> are independent.
-==== Conditional and Joint Entropy ====
+== Conditional and Joint Entropy ==
 Given <math>A</math> and <math>B</math>, and their entropies:
-{{NumBlk|::|<math>H\left(A\right)=\sum_{i=1}^n P\left(a_i\right)\cdot\log_2\left(\frac{1}{P\left(a_i\right)}\right)</math>|{{EquationRef|23}}}}
+{{NumBlk|::|<math>H\left(A\right)=\sum_{i=1}^n P\left(a_i\right)\cdot\log_2\left(\frac{1}{P\left(a_i\right)}\right)</math>|{{EquationRef|11}}}}
-{{NumBlk|::|<math>H\left(B\right)=\sum_{j=1}^m P\left(b_j\right)\cdot\log_2\left(\frac{1}{P\left(b_j\right)}\right)</math>|{{EquationRef|24}}}}
+{{NumBlk|::|<math>H\left(B\right)=\sum_{j=1}^m P\left(b_j\right)\cdot\log_2\left(\frac{1}{P\left(b_j\right)}\right)</math>|{{EquationRef|12}}}}
+=== Conditional Entropy ===
 The '''conditional entropy''' is a measure of the average uncertainty about <math>B</math> when <math>A</math> is known, and we can define it as:
@@ Line 69: / Line 74: @@
 & =\sum_{i=1}^n \sum_{j=1}^m P\left(a_i\right) P\left(b_j\mid a_i\right)\cdot\log_2\left(\frac{1}{P\left(b_j\mid a_i\right)}\right) \\
 & =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(a_i\right)}{P\left(b_j, a_i\right)}\right)\\
-\end{align}</math>|{{EquationRef|25}}}}
+\end{align}</math>|{{EquationRef|13}}}}
 And similarly,
@@ Line 76: / Line 81: @@
 & =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(b_j\right)}{P\left(b_j, a_i\right)}\right)\\
 & \neq  H\left(B\mid A\right) \\
-\end{align}</math>|{{EquationRef|26}}}}
+\end{align}</math>|{{EquationRef|14}}}}
+=== Joint Entropy ===
 If we extend the definition of entropy to two (or more) random variables, <math>A</math> and <math>B</math>, we can define the '''joint entropy''' of <math>A</math> and <math>B</math> as:
-{{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot\log_2\left(\frac{1}{P\left(a_i, b_j\right)}\right)</math>|{{EquationRef|27}}}}
+{{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot\log_2\left(\frac{1}{P\left(a_i, b_j\right)}\right)</math>|{{EquationRef|15}}}}
 Expanding expression for joint entropy, and using <math>P\left(a_i, b_j\right) = P\left(a_i\mid b_j\right)P\left(b_j\right)</math> we get:
@@ Line 93: / Line 99: @@
 & = H\left(A\mid B\right) + \sum_{j=1}^m P\left(b_j\right)\cdot \log_2\left(\frac{1}{P\left(b_j\right)}\right)\\
 & = H\left(A\mid B\right) + H\left(B\right)
-\end{align}</math>|{{EquationRef|28}}}}
+\end{align}</math>|{{EquationRef|16}}}}
 If we instead used <math>P\left(a_i, b_j\right) = P\left(b_j\mid a_i\right)P\left(a_i\right)</math>, we would get the alternative expression:
-{{NumBlk|::|<math>H\left(A, B\right)=H\left(B\mid A\right) + H\left(A\right)</math>|{{EquationRef|29}}}}
+{{NumBlk|::|<math>H\left(A, B\right)=H\left(B\mid A\right) + H\left(A\right)</math>|{{EquationRef|17}}}}
 We can then expand our expression for <math>I\left(A;B\right)</math> as:
@@ Line 109: / Line 115: @@
 & = H\left(A\right) - H\left(A\mid B\right)\\
 & = H\left(B\right) - H\left(B\mid A\right)\\
-\end{align}</math>|{{EquationRef|30}}}}
+\end{align}</math>|{{EquationRef|18}}}}
+We can then think of '''mutual information''' as the reduction in uncertainty due to another random variable. The above relationships between mutual information and the entropies are illustrated in Fig. 2. Note that <math>H\left(A\mid A\right) = 0</math> since <math>P\left(a_i\mid a_i\right)=1</math>. We can then write:
+{{NumBlk|::|<math>I\left(A; A\right)=H\left(A\right) -H\left(A\mid A\right) =H\left(A\right)  </math>|{{EquationRef|19}}}}
+Thus, we can think of entropy as ''self-information''.
+== Channel Capacity ==
+The maximum amount of information that can be transmitted through a discrete memoryless channel, or the '''channel capacity''', with units ''bits per channel use'', can then be thought of as the maximum mutual information over all possible input probability distributions:
+{{NumBlk|::|<math>C=\max_{P\left(A\right)} I\left(A;B\right)</math>|{{EquationRef|20}}}}
+Or equivalently, we need to choose <math>\{P\left(a_i\right)\}</math> such that we maximize <math>I\left(A;B\right)</math>. Since:
+{{NumBlk|::|<math>I\left(A ; B\right) = \sum_{i=1}^n P\left(a_i\right)\cdot I\left(a_i;B\right)</math>|{{EquationRef|21}}}}
+And if we are using the channel at its capacity, then for every <math>a_i</math>:
+{{NumBlk|::|<math>I\left(a_i;B\right) = C</math>|{{EquationRef|22}}}}
+Thus, we can maximize channel use by maximizing the use for each symbol independently. From the definition of mutual information and from the Gibbs inequality, we can see that:
+{{NumBlk|::|<math>C \leq H\left(A\right), H\left(B\right) \leq \log_2\left(n\right), \log_2\left(m\right)</math>|{{EquationRef|23}}}}
+Where <math>n</math> and <math>m</math> are the number of symbols in <math>A</math> and <math>B</math> respectively. Thus, the channel capacity of a channel is limited by the logarithm of the number of distinguishable symbols at its input (or output).
+{{Note|[[161-A3.1 | '''Activity A3.1''' Channel Capacity]] -- This activity introduces the concept of mutual information and channel capacity in noisy channels.|reminder}}
 == Sources ==
 * Tom Carter's [http://astarte.csustan.edu/~tom/SFI-CSSS/info-theory/info-lec.pdf notes] on Information Theory
 * Dan Hirschberg's [https://www.ics.uci.edu/~dan/pubs/DC-Sec1.html notes] on Data Compression
+* Lance Williams' [https://www.cs.unm.edu/~williams/cs530/mutual2.pdf notes] on Geometric and Probabilistic Methods in Computer Science
 == References ==
 <references />

Difference between revisions of "Mutual Information"

Latest revision as of 16:19, 29 September 2020

Contents

Definition

Non-Negativity of Mutual Information

Conditional and Joint Entropy

Conditional Entropy

Joint Entropy

Channel Capacity

Sources

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools