Difference between revisions of "Nonnegativity of Information Measures"

From Microlab Classes
Jump to navigation Jump to search
Line 59: Line 59:
 
<math> \sum_{z} p(z) I(X;Y|Z=z) = 0</math>.
 
<math> \sum_{z} p(z) I(X;Y|Z=z) = 0</math>.
  
Since <math>p(z) \ge 0</math> for all <math>z</math>, the summation will vanish if and only if each term <math>I(X;Y|Z=z)</math> is forced to be equal to zero, which means that if we know the value taken by <math>Z</math>, <math>X</math> and <math>Y</math> are conditionally independent, or <math>X\perp Y|Z</math>. This is equivalent to saying that <math>X\rightarrow Z \rightarrow Y</math> forms a Markov chain, wherein knowing <math>Z</math> "breaks" the dependence between <math>X</math> and <math>Y</math>.
+
Since <math>p(z) \ge 0</math> for all <math>z</math>, the summation will vanish if and only if each term <math>I(X;Y|Z=z)</math> is forced to be equal to zero, which means that if we know the value taken by <math>Z</math>, then <math>X</math> and <math>Y</math> are conditionally independent, or <math>X\perp Y|Z</math>. This is equivalent to saying that <math>X\rightarrow Z \rightarrow Y</math> forms a Markov chain, wherein knowing <math>Z</math> "breaks" the dependence between <math>X</math> and <math>Y</math>.

Revision as of 13:41, 27 March 2021

Information Divergence is Nonnegative

Let be two different probability distributions over a discrete alphabet . The relative entropy or Kullback-Leibler divergence between and is defined as

We will prove the following properties of the KL divergence:

  • KL divergence is always nonnegative.
  • KL divergence is zero if and only if for all .

Proof

Without loss of generality, assume that the base of the logarithm is . This assumption will allow us to use the property that for all . To keep the proof a bit simpler, we further assume that and are strictly positive. Rest assured that the result still holds either distribution is zero for some values of .

where we used the fact that discrete probability distributions add up to 1.

Whenever we encounter mathematical results expressed as an inequality, it always make sense to ask if the lower/upper bound is tight. By that, we mean to ask: when does equality hold? In the proof, we only produce an inequality by using . Hence, if we know the value of for which equality holds, we will understand better the tightness of our KL divergence bound. Indeed, our logarithmic inequality will become an equation if and only if . In the context of our proof, equality will hold if the expression equals 1, which is just another way of saying that .

Remarks

We have just shown that KL divergence is a nonnegative number obtained from comparing two discrete distributions over the same alphabet, and that it will be exactly zero precisely when the two distributions are identical. At this point, it might be tempting to assume that KL divergence is a "distance" between discrete distribution. A careful look at the expression, however, should reveal that relative entropy is not symmetric with respect to and , i.e., in general.

Checkpoint: Let be distributions over an alphabet of size 5 such that and . Calculate and .

Conditional Mutual Information is Nonnegative

In this section, we show that for any three random variables , the conditional mutual information is nonnegative. The important consequences of this statement are the following:

  • Entropy is nonnegative.
  • Mutual information is nonnegative.
  • Conditional entropy is nonnegative.

These consequences are immediate by considering special cases for . The entropy of can be obtained as follows: , where . The conditional entropy of given is similarly obtained by removing the independence restriction: . Lastly, (unconditional) mutual information can be obtained by requiring that .

Proof

The proof is outlined as follows:

  • We express the mutual information between and as the KL-divergence between two distributions over the joint alphabet .
  • We use the nonnegativity of relative entropy to argue that .
  • We show that conditioning will preserve nonnegativity, i.e., .

The first part of the proof becomes apparent after writing down the definition of mutual information:

The two distributions over that we need to set up KL divergence are as follows: (1) the joint distribution , and (2) the distribution obtained by simply multiplying the marginal probabilities and . Observe that the latter distribution corresponds to the joint distribution if and were independent. With a slight abuse of notation, we now write

,

where we have used the nonnegativity of relative entropy. This small result is quite satisfying on its own because it reinforces this notion that mutual information is related to the degree of non-independence between and . Indeed, equality will hold if and only if for all , which is equivalent to and being independent.

Arguing for the nonnegativity of follows from the following expression: . Observe the similarity between this equation and conditional entropy. Since are nonnegative values, then the linear combination must also be nonnegative and the conclusion follows.

Connection to Markov Chains

Conditional mutual information is closely connected with the conditional independence property in Markov chains. In particular, observe that conditioning any information measure can never increase it: for any random variables , , .

Checkpoint: Starting with , argue that .

The conditions under which equality holds, i.e. , can be seen by setting

.

Since for all , the summation will vanish if and only if each term is forced to be equal to zero, which means that if we know the value taken by , then and are conditionally independent, or . This is equivalent to saying that forms a Markov chain, wherein knowing "breaks" the dependence between and .