Difference between revisions of "Mutual Information"
(Created page with "In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with...") |
|||
Line 1: | Line 1: | ||
In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols <math>A=\{a_1, a_2, \ldots, a_n\}</math>, and output symbols <math>B=\{b_1, b_2, \ldots, b_m\}</math>. Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol <math>b_j</math>, we are not sure which <math>a_i</math> was the input symbol. We can then characterize the channel as a set of probabilities <math>\{P\left(a_i\mid b_j\right)\}</math>. Let us consider the information we get from observing a symbol <math>b_j</math>. | In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols <math>A=\{a_1, a_2, \ldots, a_n\}</math>, and output symbols <math>B=\{b_1, b_2, \ldots, b_m\}</math>. Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol <math>b_j</math>, we are not sure which <math>a_i</math> was the input symbol. We can then characterize the channel as a set of probabilities <math>\{P\left(a_i\mid b_j\right)\}</math>. Let us consider the information we get from observing a symbol <math>b_j</math>. | ||
− | + | == Definition == | |
Given a probability model of the source, we have an ''a priori'' estimate <math>P\left(a_i\right)</math> that symbol <math>a_i</math> will be sent next. Upon observing <math>b_j</math>, we can revise our estimate to <math>P\left(a_i\mid b_j\right)</math>. The change in information, or ''mutual information'', is given by: | Given a probability model of the source, we have an ''a priori'' estimate <math>P\left(a_i\right)</math> that symbol <math>a_i</math> will be sent next. Upon observing <math>b_j</math>, we can revise our estimate to <math>P\left(a_i\mid b_j\right)</math>. The change in information, or ''mutual information'', is given by: | ||
− | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{1}{P\left(a_i\right)}\right)-\log_2\left(\frac{1}{P\left(a_i \mid b_j\right)}\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{1}{P\left(a_i\right)}\right)-\log_2\left(\frac{1}{P\left(a_i \mid b_j\right)}\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|1}}}} |
Let's look at a few properties of mutual information. Expressing the equation above in terms of <math>I\left(a_i\right)</math>: | Let's look at a few properties of mutual information. Expressing the equation above in terms of <math>I\left(a_i\right)</math>: | ||
− | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=I\left(a_i\right) + \log_2\left(P\left(a_i \mid b_j\right)\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=I\left(a_i\right) + \log_2\left(P\left(a_i \mid b_j\right)\right)</math>|{{EquationRef|2}}}} |
Thus, we can say: | Thus, we can say: | ||
− | {{NumBlk|::|<math>I\left(a_i ; b_j\right)\leq I\left(a_i\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; b_j\right)\leq I\left(a_i\right)</math>|{{EquationRef|3}}}} |
This is expected since, after observing <math>b_j</math>, the amount of uncertainty is reduced, i.e. we know a bit more about <math>a_i</math>, and the most change in information we can get is when <math>a_i</math> and <math>b_j</math> are perfectly correlated, with <math>I\left(a_i ; b_j\right)= I\left(a_i\right)</math>. From Bayes' Theorem, we have the property: | This is expected since, after observing <math>b_j</math>, the amount of uncertainty is reduced, i.e. we know a bit more about <math>a_i</math>, and the most change in information we can get is when <math>a_i</math> and <math>b_j</math> are perfectly correlated, with <math>I\left(a_i ; b_j\right)= I\left(a_i\right)</math>. From Bayes' Theorem, we have the property: | ||
− | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)=\log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right)=I\left(b_j ; a_i\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right)=\log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right)=I\left(b_j ; a_i\right)</math>|{{EquationRef|4}}}} |
Note that if <math>a_i</math> and <math>b_j</math> are independent, where <math>P\left(a_i\mid b_j\right) = P\left(a_i\right)</math> and <math>P\left(b_j\mid a_i\right) = P\left(b_j\right)</math>, then: | Note that if <math>a_i</math> and <math>b_j</math> are independent, where <math>P\left(a_i\mid b_j\right) = P\left(a_i\right)</math> and <math>P\left(b_j\mid a_i\right) = P\left(b_j\right)</math>, then: | ||
− | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right) = \log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right) = \log_2\left(1\right)= 0</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; b_j\right)=\log_2\left(\frac{P\left(a_i \mid b_j\right)}{P\left(a_i\right)}\right) = \log_2\left(\frac{P\left(b_j \mid a_i\right)}{P\left(b_j\right)}\right) = \log_2\left(1\right)= 0</math>|{{EquationRef|5}}}} |
We can get the average mutual information over all the input symbols as: | We can get the average mutual information over all the input symbols as: | ||
− | {{NumBlk|::|<math>I\left(A ; b_j\right)= \sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot I\left(a_i;b_j\right)=\sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot \log_2\left(\frac{P\left(a_i\mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(A ; b_j\right)= \sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot I\left(a_i;b_j\right)=\sum_{i=1}^n P\left(a_i\mid b_j\right)\cdot \log_2\left(\frac{P\left(a_i\mid b_j\right)}{P\left(a_i\right)}\right)</math>|{{EquationRef|6}}}} |
Similarly, for all the output symbols: | Similarly, for all the output symbols: | ||
− | {{NumBlk|::|<math>I\left(a_i ; B\right)= \sum_{j=1}^m P\left(b_j\mid a_i\right)\cdot \log_2\left(\frac{P\left(b_j\mid a_i\right)}{P\left(b_j\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>I\left(a_i ; B\right)= \sum_{j=1}^m P\left(b_j\mid a_i\right)\cdot \log_2\left(\frac{P\left(b_j\mid a_i\right)}{P\left(b_j\right)}\right)</math>|{{EquationRef|7}}}} |
For both input and output symbols, we get: | For both input and output symbols, we get: | ||
Line 36: | Line 36: | ||
& = \sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot \log_2\left(\frac{P\left( a_i, b_j\right)}{P\left(a_i\right)\cdot P\left(b_j\right)}\right) \\ | & = \sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot \log_2\left(\frac{P\left( a_i, b_j\right)}{P\left(a_i\right)\cdot P\left(b_j\right)}\right) \\ | ||
& = I\left(B ; A\right) | & = I\left(B ; A\right) | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|8}}}} |
− | + | == Non-Negativity of Mutual Information == | |
To show the non-negativity of mutual information, let us use ''Jensen's Inequality'', which states that for a convex function, <math>f\left(x\right)</math>: | To show the non-negativity of mutual information, let us use ''Jensen's Inequality'', which states that for a convex function, <math>f\left(x\right)</math>: | ||
− | {{NumBlk|::|<math>\langle f\left(x\right)\rangle \ge f\left(\langle x\rangle\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>\langle f\left(x\right)\rangle \ge f\left(\langle x\rangle\right)</math>|{{EquationRef|9}}}} |
Using the fact that <math>f\left(x\right)=-\log_2\left( x\right)</math> is convex, and applying this to our expression for mutual information, we get: | Using the fact that <math>f\left(x\right)=-\log_2\left( x\right)</math> is convex, and applying this to our expression for mutual information, we get: | ||
Line 54: | Line 54: | ||
= -\log_2\left(\sum_{i=1}^n P\left(a_i\right) \sum_{j=1}^m P\left(b_j\right)\right) = -\log_2\left(1\right) \\ | = -\log_2\left(\sum_{i=1}^n P\left(a_i\right) \sum_{j=1}^m P\left(b_j\right)\right) = -\log_2\left(1\right) \\ | ||
& \ge 0\\ | & \ge 0\\ | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|10}}}} |
Note that <math>I\left(A ; B\right) =0</math> when <math>A</math> and <math>B</math> are independent. | Note that <math>I\left(A ; B\right) =0</math> when <math>A</math> and <math>B</math> are independent. | ||
− | + | == Conditional and Joint Entropy == | |
Given <math>A</math> and <math>B</math>, and their entropies: | Given <math>A</math> and <math>B</math>, and their entropies: | ||
− | {{NumBlk|::|<math>H\left(A\right)=\sum_{i=1}^n P\left(a_i\right)\cdot\log_2\left(\frac{1}{P\left(a_i\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>H\left(A\right)=\sum_{i=1}^n P\left(a_i\right)\cdot\log_2\left(\frac{1}{P\left(a_i\right)}\right)</math>|{{EquationRef|11}}}} |
− | {{NumBlk|::|<math>H\left(B\right)=\sum_{j=1}^m P\left(b_j\right)\cdot\log_2\left(\frac{1}{P\left(b_j\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>H\left(B\right)=\sum_{j=1}^m P\left(b_j\right)\cdot\log_2\left(\frac{1}{P\left(b_j\right)}\right)</math>|{{EquationRef|12}}}} |
+ | === Conditional Entropy === | ||
The '''conditional entropy''' is a measure of the average uncertainty about <math>B</math> when <math>A</math> is known, and we can define it as: | The '''conditional entropy''' is a measure of the average uncertainty about <math>B</math> when <math>A</math> is known, and we can define it as: | ||
Line 69: | Line 70: | ||
& =\sum_{i=1}^n \sum_{j=1}^m P\left(a_i\right) P\left(b_j\mid a_i\right)\cdot\log_2\left(\frac{1}{P\left(b_j\mid a_i\right)}\right) \\ | & =\sum_{i=1}^n \sum_{j=1}^m P\left(a_i\right) P\left(b_j\mid a_i\right)\cdot\log_2\left(\frac{1}{P\left(b_j\mid a_i\right)}\right) \\ | ||
& =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(a_i\right)}{P\left(b_j, a_i\right)}\right)\\ | & =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(a_i\right)}{P\left(b_j, a_i\right)}\right)\\ | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|13}}}} |
And similarly, | And similarly, | ||
Line 76: | Line 77: | ||
& =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(b_j\right)}{P\left(b_j, a_i\right)}\right)\\ | & =\sum_{i=1}^n \sum_{j=1}^m P\left(b_j, a_i\right)\cdot\log_2\left(\frac{P\left(b_j\right)}{P\left(b_j, a_i\right)}\right)\\ | ||
& \neq H\left(B\mid A\right) \\ | & \neq H\left(B\mid A\right) \\ | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|14}}}} |
+ | === Joint Entropy === | ||
If we extend the definition of entropy to two (or more) random variables, <math>A</math> and <math>B</math>, we can define the '''joint entropy''' of <math>A</math> and <math>B</math> as: | If we extend the definition of entropy to two (or more) random variables, <math>A</math> and <math>B</math>, we can define the '''joint entropy''' of <math>A</math> and <math>B</math> as: | ||
− | {{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot\log_2\left(\frac{1}{P\left(a_i, b_j\right)}\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\cdot\log_2\left(\frac{1}{P\left(a_i, b_j\right)}\right)</math>|{{EquationRef|15}}}} |
Expanding expression for joint entropy, and using <math>P\left(a_i, b_j\right) = P\left(a_i\mid b_j\right)P\left(b_j\right)</math> we get: | Expanding expression for joint entropy, and using <math>P\left(a_i, b_j\right) = P\left(a_i\mid b_j\right)P\left(b_j\right)</math> we get: | ||
Line 93: | Line 95: | ||
& = H\left(A\mid B\right) + \sum_{j=1}^m P\left(b_j\right)\cdot \log_2\left(\frac{1}{P\left(b_j\right)}\right)\\ | & = H\left(A\mid B\right) + \sum_{j=1}^m P\left(b_j\right)\cdot \log_2\left(\frac{1}{P\left(b_j\right)}\right)\\ | ||
& = H\left(A\mid B\right) + H\left(B\right) | & = H\left(A\mid B\right) + H\left(B\right) | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|16}}}} |
If we instead used <math>P\left(a_i, b_j\right) = P\left(b_j\mid a_i\right)P\left(a_i\right)</math>, we would get the alternative expression: | If we instead used <math>P\left(a_i, b_j\right) = P\left(b_j\mid a_i\right)P\left(a_i\right)</math>, we would get the alternative expression: | ||
− | {{NumBlk|::|<math>H\left(A, B\right)=H\left(B\mid A\right) + H\left(A\right)</math>|{{EquationRef| | + | {{NumBlk|::|<math>H\left(A, B\right)=H\left(B\mid A\right) + H\left(A\right)</math>|{{EquationRef|17}}}} |
We can then expand our expression for <math>I\left(A;B\right)</math> as: | We can then expand our expression for <math>I\left(A;B\right)</math> as: | ||
Line 109: | Line 111: | ||
& = H\left(A\right) - H\left(A\mid B\right)\\ | & = H\left(A\right) - H\left(A\mid B\right)\\ | ||
& = H\left(B\right) - H\left(B\mid A\right)\\ | & = H\left(B\right) - H\left(B\mid A\right)\\ | ||
− | \end{align}</math>|{{EquationRef| | + | \end{align}</math>|{{EquationRef|18}}}} |
== Sources == | == Sources == |
Revision as of 16:01, 17 September 2020
In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols , and output symbols . Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol , we are not sure which was the input symbol. We can then characterize the channel as a set of probabilities . Let us consider the information we get from observing a symbol .
Contents
Definition
Given a probability model of the source, we have an a priori estimate that symbol will be sent next. Upon observing , we can revise our estimate to . The change in information, or mutual information, is given by:
-
(1)
-
Let's look at a few properties of mutual information. Expressing the equation above in terms of :
-
(2)
-
Thus, we can say:
-
(3)
-
This is expected since, after observing , the amount of uncertainty is reduced, i.e. we know a bit more about , and the most change in information we can get is when and are perfectly correlated, with . From Bayes' Theorem, we have the property:
-
(4)
-
Note that if and are independent, where and , then:
-
(5)
-
We can get the average mutual information over all the input symbols as:
-
(6)
-
Similarly, for all the output symbols:
-
(7)
-
For both input and output symbols, we get:
-
(8)
-
Non-Negativity of Mutual Information
To show the non-negativity of mutual information, let us use Jensen's Inequality, which states that for a convex function, :
-
(9)
-
Using the fact that is convex, and applying this to our expression for mutual information, we get:
-
(10)
-
Note that when and are independent.
Conditional and Joint Entropy
Given and , and their entropies:
-
(11)
-
-
(12)
-
Conditional Entropy
The conditional entropy is a measure of the average uncertainty about when is known, and we can define it as:
-
(13)
-
And similarly,
-
(14)
-
Joint Entropy
If we extend the definition of entropy to two (or more) random variables, and , we can define the joint entropy of and as:
-
(15)
-
Expanding expression for joint entropy, and using we get:
-
(16)
-
If we instead used , we would get the alternative expression:
-
(17)
-
We can then expand our expression for as:
-
(18)
-