Difference between revisions of "Conditional Entropy and Mutual Information"

From Microlab Classes
Jump to navigation Jump to search
(Created page with "When we work with multiple information sources (random variables) <math>X</math> and <math>Y</math>, the following details are useful: * How much information overlap is there...")
 
 
Line 15: Line 15:
 
<math>H(X|Y) = -\sum_{x\in \mathcal{X}}\sum_{y\in\mathcal{Y}} p(x,y) \log p(x|y)</math>
 
<math>H(X|Y) = -\sum_{x\in \mathcal{X}}\sum_{y\in\mathcal{Y}} p(x,y) \log p(x|y)</math>
  
Note that <math>I(X;Y)</math> is symmetric but conditional entropy is not, i.e. in general <math>H(X|Y)</math> is not equal to <math>H(Y|X)</math>.
+
Observe that the fraction in the definition of <math>I(X;Y)</math> consists of (1) the joint distribution <math>p(x,y)</math>, and (2) the product of marginal distributions <math>p(x)p(y)</math>. <math>I(X; Y)</math> is a measure of how far <math>X</math> and <math>Y</math> is from being independent, in which case <math>p(x,y) = p(x)p(y)</math> and the entire summation vanishes since <math> \log 1 = 0</math>. In our discussions, we simplify the notation so that <math>p(x,y)</math> and <math>p(y,x)</math> both refer to the probability that <math>X=x,Y=y</math>. With this remark, <math>I(X;Y) = I(Y; X)</math>. However, conditional entropy is not symmetric, i.e. in general <math>H(X|Y)</math> is not equal to <math>H(Y|X)</math>.
  
 
When <math>X</math> and <math>Y</math> are independent, all conditional probabilities <math>p(x|y)</math> reduces to <math>p(x)</math>, i.e., conditioning on <math>Y</math> does not change the probability of <math>X</math>. Hence,
 
When <math>X</math> and <math>Y</math> are independent, all conditional probabilities <math>p(x|y)</math> reduces to <math>p(x)</math>, i.e., conditioning on <math>Y</math> does not change the probability of <math>X</math>. Hence,

Latest revision as of 13:17, 6 March 2021

When we work with multiple information sources (random variables) and , the following details are useful:

  • How much information overlap is there between and ?
  • How much information is contained in one but not present in the other?

In information theory, we formalize the first question using the concept of mutual information, while the second question is addressed using the concept of conditional entropy.

Definition

For two random variables and , the mutual information between them, denoted by is given by

,

and the conditional entropy of given is given by

Observe that the fraction in the definition of consists of (1) the joint distribution , and (2) the product of marginal distributions . is a measure of how far and is from being independent, in which case and the entire summation vanishes since . In our discussions, we simplify the notation so that and both refer to the probability that . With this remark, . However, conditional entropy is not symmetric, i.e. in general is not equal to .

When and are independent, all conditional probabilities reduces to , i.e., conditioning on does not change the probability of . Hence,

Thus, if and are independent.

Checkpoint: Why did disappear in the last step?

Even if we do not require and to be independent, rearranging terms reveals yet another intuitive property.

The summation inside the parenthesis should be very familiar. Indeed, if we ignore the conditioning on , we see that the summation is essentially a calculation of entropy. Together with the , is simply a weighted average of multiple entropy values! For a detailed discussion of this property with numerical example, please see Activity 2.

Entropy, conditional entropy, and mutual information

Below are some properties that relate all information measures we discussed so far (entropy, conditional entropy, mutual information):

  • where equality holds if and only if and are independent.
  • For all random variables and ,

The first property is crucial in understanding the implications of mutual information. Wherever occurs in important properties, it would greatly help to check if a property with still makes sense, knowing that and would be independent under such constraint.

Checkpoint: Do the second and third properties make sense when ?

In many cases, however, would be strictly positive. The second property then implies that in many cases, would be strictly less than , i.e., conditioning reduces entropy, and is precisely the amount of reduction in entropy/information about when we know .

The Venn diagram shown below serves as a visual aid to remember the relationships between entropy, conditional entropy and mutual information.

Venn diagram.svgVenn diagram joint entropy.svg

Extensions

What happens when we deal with more than two random variables? To facilitate the discussion, let us recall the chain rule for joint distributions.

Let be a sequence of discrete random variables. Then, their joint distribution can be factored as follows

,

Chain rule for entropy

The chain rule for (joint) entropy is very similar to the above expansion, but we use additions instead of multiplications.

Checkpoint: Using the properties in the previous section, show that .

Although we do not supply a complete proof here, this fact should not be too surprising since entropy operates on logarithms of probabilities and logarithms of product terms expand to sums of logarithms.

Proof of chain rule for

Let us see that the statement is true for . Let be three discrete random variables. The idea with the proof is that we operate on two random variables at a time, since prior to the chain rule we only know that . We can write , where we bundle and into one random variable .

The proof now proceeds as follows:

One interpretation of the chain rule shown is that to obtain the total information (joint entropy) of as a whole, we can

  • obtain information about first without any prior knowledge: , then
  • obtain information about with knowledge of : , then
  • obtain information about with knowledge of and : .

For , we can proceed by induction. Below is the sketch of the proof:

  • (Base step) For a fixed n=k, assume that any collection of random variables satisfies the chain rule.
  • (Induction step) When n = k+1, write the joint entropy of rvs as follows: , where the first are bundled together into one random variable.
  • Use the two-variable chain rule and the base step to show that .
Checkpoint: The order of "obtaining information" is irrelevant in calculating the joint entropy of multiple rvs. Write the chain rule if we proceed by obtaining information in the following order: .
Checkpoint: The chain we have now works for any collection of random variables. Can you figure out how to simplify the chain rule for Markov chains? (The answer will be discussed in Module 3.)