Difference between revisions of "Joint entropy, conditional entropy, and mutual information"

From Microlab Classes
Jump to navigation Jump to search
(Created page with "== Joint Entropies == In this module, we'll discuss several extensions of entropy. Let's begin with joint entropy. Suppose we have a random variable <math> X </math> with ele...")
 
Line 3: Line 3:
 
In this module, we'll discuss several extensions of entropy. Let's begin with joint entropy. Suppose we have a random variable <math> X </math> with elements <math> \{x_1, x_2,..., x_i, ... , x_n \} </math> and random variable <math> Y </math> with elements <math> \{y_1, y_2, ..., y_j, ..., y_m \}</math>. We define the joint entropy of <math> H(X,Y) </math> as:
 
In this module, we'll discuss several extensions of entropy. Let's begin with joint entropy. Suppose we have a random variable <math> X </math> with elements <math> \{x_1, x_2,..., x_i, ... , x_n \} </math> and random variable <math> Y </math> with elements <math> \{y_1, y_2, ..., y_j, ..., y_m \}</math>. We define the joint entropy of <math> H(X,Y) </math> as:
  
{{NumBlk|::|<math> H(X,Y) = - \sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log \left(P(x_i, y_j) \right) </math>|{{EquationRef|1}}}}
+
{{NumBlk|::|<math> H(X,Y) = -\sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log_2 \left(P(x_i, y_j) \right) </math>|{{EquationRef|1}}}}
  
 
Take note that <math> H(X,Y) = H(Y,X) </math>. This is trivial. From our previous discussion about the decision trees of a game, we used joint entropies to calculate the overall entropy.
 
Take note that <math> H(X,Y) = H(Y,X) </math>. This is trivial. From our previous discussion about the decision trees of a game, we used joint entropies to calculate the overall entropy.
Line 9: Line 9:
 
== Conditional Entropies ==
 
== Conditional Entropies ==
  
There are two steps to understand conditional entropies. The first is the uncertainty of a random variable changes when a single outcome occurs. Suppose we have the same random variables <math> X </math> and <math> Y </math> defined earlier in joint entropies. Let's denote <math> P(y_j|x_i) </math> as the conditional probability of <math> y_j </math> when <math> x_i </math> event happened. We define the entropy <math> H(Y | x_i) </math> as the entropy of the random variable <math> Y </math> given a single outcome <math> x_i </math> happened. Mathematically this is:
+
There are two steps to understand conditional entropies. The first is the uncertainty of a random variable caused by a single outcome only. Suppose we have the same random variables <math> X </math> and <math> Y </math> defined earlier in joint entropies. Let's denote <math> P(y_j|x_i) </math> as the conditional probability of <math> y_j </math> when event <math> x_i </math> happened. We define the entropy <math> H(Y | x_i) </math> as the entropy of the random variable <math> Y </math> given a <math> x_i </math> happened. Take note that we're only interested in the entropy of <math> Y </math> when only the outcome <math> x_i </math> occurred. Mathematically this is:
  
{{NumBlk|::|<math> H(Y|x_i) = - -\sum_{j=1}^m P(y_j|x_i) \log \left(P(y_j| x_i) \right) </math>|{{EquationRef|2}}}}
+
{{NumBlk|::|<math> H(Y|x_i) = -\sum_{j=1}^m P(y_j|x_i) \log_2 \left(P(y_j| x_i) \right) </math>|{{EquationRef|2}}}}
  
Stated in a different way, <math>H(Y|x_i) </math> is our measure of uncertainty about <math> Y </math> when we know <math> x_i </math> occurred. Take note that equation 2 just pertains to the uncertainty when a single event happened. We'll extend this to what would the uncertainty be when any of the events in <math> X </math> happens.
+
Just to repeat because it may be confusing, equation 2 just pertains to the uncertainty when only a ''single event'' happened. We can extend this to the total entropy <math> Y </math> when ''any'' of the outcomes in <math> X </math> happens. If we treat <math> H(Y|\cdot) </math> contains a range of <math> \{H(Y|x_1), H(Y|x_2), H(Y|x_3),...,H(Y|x_n) \} </math> and the probability distribution for each associated <math> H(Y|x_i) </math> is <math> \{P(x_1), P(x_2), P(x_3), ... P(x_n) \} </math> so that <math> H(Y|\cdot) </math> is a function of <math> X </math>. Then we define <math> H(Y|X) </math> as the conditional probability of <math> Y </math> given <math> X </math> as:
  
 +
{{NumBlk|::|<math> H(Y|X) = E(H(Y|\cdot)) = -\sum_{j=1}^m P(x_i) H(Y|x_i) </math>|{{EquationRef|3}}}}
  
 +
Substituting equation 2 to equation 3 and knowing that <math> P(y_j|x_i)P(x_i) = P(x_i,y_j) </math>. We can re-write this as:
 +
 +
{{NumBlk|::|<math> H(Y|X) = -\sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log_2 (P(y_j|x_i)) </math>|{{EquationRef|3}}}}
 +
 +
Note that it is trivial to prove that <math> H(Y|X) = H(Y) </math> if <math> Y </math> and <math> X </math> are independent. We'll leave it up to you to prove this. A few hints include: <math> P(y_j|x_i) = P(y_j) </math> and <math> P(y_j,x_i) = P(x_i)P(y_j) </math> if <math> Y </math> and <math> X </math> are independent.
 +
 +
== Binary Tree Example ==
 +
 +
[[File:Binary tree example.PNG|300px|thumb|right|Figure 1: Binary tree example.]]
 +
 +
Let's apply the first two concepts in a simple binary tree example. Suppose we let <math> X </math> be a random variable with outcomes <math>\{a,b\} </math> and probabilities <math> \{P(a) = P(b) = \frac{1}{2} \} </math>. Let <math> Y </math> be the random variable with outcomes <math> \{ w,x,y,z \} </math>. We are also given the probabilities <math> P(w|a) = \frac{2}{3} </math>, <math> P(x|a) = \frac{1}{3} </math>, <math> P(y|b) = \frac{1}{2} </math>, and <math> P(z|b) = \frac{1}{2} </math>. Figure 1 shows the binary tree. Calculate:
 +
 +
(a) <math> H(Y|a) </math>
 +
 +
(b) <math> H(Y|b) </math>
 +
 +
(c) <math> H(Y|X) </math>
 +
 +
(d) <math> H(X,Y) </math>
 +
 +
'''Solution'''
 +
 +
(a) Use equation 2 to solve <math> H(Y|a) </math>
 +
 +
<math>
 +
\begin{align}
 +
  H(Y|a) &= -P(w|a) \log_2 (P(w|a)) -P(x|a) \log_2 (P(x|a)) \\
 +
        &= -\frac{2}{3} \log_2 (\frac{2}{3}) -\frac{1}{3} \log_2 (\frac{1}{3}) \\
 +
        &= 0.918 \ \textrm{bits}
 +
\end{align}
 +
</math>
 +
 +
(b) Same as in (a), use equation 2 to solve <math> H(Y|b) </math>
 +
 +
<math>
 +
\begin{align}
 +
  H(Y|b) &= -P(y|b) \log_2 (P(y|b)) -P(z|b) \log_2 (P(z|b)) \\
 +
        &= -\frac{1}{2} \log_2 (\frac{1}{2}) -\frac{1}{2} \log_2 (\frac{1}{2}) \\
 +
        &= 1.000 \ \textrm{bits}
 +
\end{align}
 +
</math>
 +
 +
(c) Can be solved in two ways. First is to use equation 3.
 +
 +
<math>
 +
\begin{align}
 +
  H(Y|X) &= E(H(Y|\cdot)) \\
 +
        &= P(a)H(Y|a) + P(b)H(Y|b) \\
 +
        &= \frac{1}{2} (0.918) + \frac{1}{2} (1.000) \\
 +
        &= 0.959 \ \textrm{bits}
 +
\end{align}
 +
</math>
 +
 +
Or, we can solve it using the alternative version of equation 3. But we also need to know the joint probabilities:
 +
 +
* <math> P(a,w) = \frac{1}{3} </math>
 +
* <math> P(a,x) = \frac{1}{6} </math>
 +
* <math> P(b,y) = \frac{1}{4} </math>
 +
* <math> P(b,z) = \frac{1}{4} </math>
 +
 +
<math>
 +
\begin{align}
 +
  H(Y|X) &= -\sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log_2 (P(y_j|x_i)) \\
 +
        &= -P(a,w) \log_2(P(a|w)) -P(a,x) \log_2(P(a|x)) -P(b,y) \log_2(P(b|y)) -P(b,z) \log_2(P(b|z)) \\
 +
        &= -\frac{1}{3} \log_2 \left(\frac{2}{3}\right) -\frac{1}{6} \log_2 \left(\frac{1}{3}\right) -\frac{1}{4} \log_2 \left(\frac{1}{2}\right) -\frac{1}{4} \log_2 \left(\frac{1}{2}\right) \\
 +
        &= 0.959 \ \textrm{bits}
 +
\end{align}
 +
</math>
 +
 +
(d) We already listed the joint probabilities in (c). We simply use equation 1 for this:
 +
 +
<math>
 +
\begin{align}
 +
  H(X,Y) &= -\sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log_2 \left(P(x_i, y_j) \right) \\
 +
        &= -P(a,w) \log_2(P(a,w)) -P(a,x) \log_2(P(a,x)) -P(b,y) \log_2(P(b,y)) -P(b,z) \log_2(P(b,z)) \\
 +
        &= -\frac{1}{3} \log_2 \left(\frac{1}{3}\right) -\frac{1}{6} \log_2 \left(\frac{1}{6}\right) -\frac{1}{4} \log_2 \left(\frac{1}{4}\right) -\frac{1}{4} \log_2 \left(\frac{1}{4}\right) \\
 +
        &= 1.959 \ \textrm{bits}
 +
\end{align}
 +
</math>
 +
 +
 +
'''Important Note!'''
 +
 +
This example actually shows us an important observation:
 +
 +
{{NumBlk|::|<math> H(X,Y) = H(X) + H(Y|X) </math>|{{EquationRef|4}}}}
 +
 +
We can easily observe that this holds true for the binary tree example. This has an important interpretation: ''' The combined uncertainty in <math> X </math> and <math> Y </math> (i.e., <math> H(X,Y) </math>) is the sum of that uncertainty which is totally due to <math> X </math> (i.e., <math> H(X) </math>), and that which is still due to <math> Y </math> once <math> X </math> has been accounted for (i.e., <math> H(Y|X) </math>).'''
 +
 +
Since <math> H(X,Y) = H(Y,X) </math> it also follows that:
 +
 +
{{NumBlk|::|<math> H(Y,X) = H(Y) + H(X|Y) </math>|{{EquationRef|5}}}}
 +
 +
If <math> X </math> and <math> Y </math> are independent the:
 +
 +
{{NumBlk|::|<math> H(X,Y) = H(X) + H(Y) </math>|{{EquationRef|6}}}}
 +
 +
''' An alternative interpretation to <math> H(Y|X) </math> is the information content of <math> Y </math> which is NOT contained in <math> X </math>. '''
  
 
== Mutual Information ==
 
== Mutual Information ==
 +
 +
With the final discussion on the properties of joint and conditional entropies, we also define ''mutual information'' as the information content of <math> Y </math> contained within <math> X </math> written as:
 +
 +
{{NumBlk|::|<math> I(X,Y) = H(Y) - H(Y|X) </math>|{{EquationRef|6}}}}
 +
 +
Or:
 +
 +
{{NumBlk|::|<math> I(X,Y) = H(X) - H(X|Y) </math>|{{EquationRef|6}}}}
 +
 +
If we plug in the definitions of entropy and conditional entropy, mutual information in expanded form is:
 +
 +
{{NumBlk|::|<math> I(X,Y) = \sum_{i=1}^n \sum_{j=1}^m P(x_i,y_j) \log_2 \left(\frac{P(x_i,y_j)}{P(x_i)P(y_j)} \right) </math>|{{EquationRef|7}}}}
 +
 +
It also follows that:
 +
 +
* <math> I(X,Y) = I(Y,X) </math>. This is trivial from equation 7.
 +
* If <math> X </math> and <math> Y </math> are independent, then <math> I(X,Y) = 0 </math>. This is also trivial from equation 7.
 +
 +
As a short example, we can use the binary tree problem again. Let's compute <math> I(X,Y) </math>. We simply need to use equation 6: <math> I(X,Y) = H(Y) - H(Y|X) </math>. We already know <math> H(Y|X) = 0.959 </math> bits. <math> H(Y) </math> is computed by first knowing <math>\{P(w), P(x), P(y), P(z) \} </math>.  Recall that:
 +
 +
<math> P(y_j) = \sum_{i=1}^n P(y_j|x_i)P(x_i) </math>
 +
 +
But then, for each outcome in <math> Y </math> has the same probability of the joint entropy. For example, event <math> w </math> only occurs if event <math> a </math> occurred first. Event <math> x </math> occurs only if event <math> b </math> occurred first. So it means that:
 +
 +
* <math> P(w) = P(w|a)P(a) = P(w,a) = \frac{1}{3} </math>
 +
* <math> P(x) = P(x|a)P(a) = P(x,a) = \frac{1}{6} </math>
 +
* <math> P(y) = P(y|b)P(b) = P(y,b) = \frac{1}{4} </math>
 +
* <math> P(z) = P(z|b)P(b) = P(z,b) = \frac{1}{4} </math>
 +
 +
Therefore, <math> H(Y) = H(X,Y) = 1.959 </math> bits. Therefore, the information <math> I(X,Y) = H(Y) - H(Y|X) = 1.959 - 0.959 = 1.000 </math> bits.
  
  

Revision as of 00:53, 12 February 2022

Joint Entropies

In this module, we'll discuss several extensions of entropy. Let's begin with joint entropy. Suppose we have a random variable with elements and random variable with elements . We define the joint entropy of as:

 

 

 

 

(1)

Take note that . This is trivial. From our previous discussion about the decision trees of a game, we used joint entropies to calculate the overall entropy.

Conditional Entropies

There are two steps to understand conditional entropies. The first is the uncertainty of a random variable caused by a single outcome only. Suppose we have the same random variables and defined earlier in joint entropies. Let's denote as the conditional probability of when event happened. We define the entropy as the entropy of the random variable given a happened. Take note that we're only interested in the entropy of when only the outcome occurred. Mathematically this is:

 

 

 

 

(2)

Just to repeat because it may be confusing, equation 2 just pertains to the uncertainty when only a single event happened. We can extend this to the total entropy when any of the outcomes in happens. If we treat contains a range of and the probability distribution for each associated is so that is a function of . Then we define as the conditional probability of given as:

 

 

 

 

(3)

Substituting equation 2 to equation 3 and knowing that . We can re-write this as:

 

 

 

 

(3)

Note that it is trivial to prove that if and are independent. We'll leave it up to you to prove this. A few hints include: and if and are independent.

Binary Tree Example

Figure 1: Binary tree example.

Let's apply the first two concepts in a simple binary tree example. Suppose we let be a random variable with outcomes and probabilities . Let be the random variable with outcomes . We are also given the probabilities , , , and . Figure 1 shows the binary tree. Calculate:

(a)

(b)

(c)

(d)

Solution

(a) Use equation 2 to solve

(b) Same as in (a), use equation 2 to solve

(c) Can be solved in two ways. First is to use equation 3.

Or, we can solve it using the alternative version of equation 3. But we also need to know the joint probabilities:

(d) We already listed the joint probabilities in (c). We simply use equation 1 for this:


Important Note!

This example actually shows us an important observation:

 

 

 

 

(4)

We can easily observe that this holds true for the binary tree example. This has an important interpretation: The combined uncertainty in and (i.e., ) is the sum of that uncertainty which is totally due to (i.e., ), and that which is still due to once has been accounted for (i.e., ).

Since it also follows that:

 

 

 

 

(5)

If and are independent the:

 

 

 

 

(6)

An alternative interpretation to is the information content of which is NOT contained in .

Mutual Information

With the final discussion on the properties of joint and conditional entropies, we also define mutual information as the information content of contained within written as:

 

 

 

 

(6)

Or:

 

 

 

 

(6)

If we plug in the definitions of entropy and conditional entropy, mutual information in expanded form is:

 

 

 

 

(7)

It also follows that:

  • . This is trivial from equation 7.
  • If and are independent, then . This is also trivial from equation 7.

As a short example, we can use the binary tree problem again. Let's compute . We simply need to use equation 6: . We already know bits. is computed by first knowing . Recall that:

But then, for each outcome in has the same probability of the joint entropy. For example, event only occurs if event occurred first. Event occurs only if event occurred first. So it means that:

Therefore, bits. Therefore, the information bits.


Graphical Interpretation