Difference between revisions of "Nonnegativity of Information Measures"
Adrian Vidal (talk | contribs) (Created page with "== Information Divergence is Nonnegative== Let <math>p,q</math> be two different probability distributions over a discrete alphabet <math>\mathcal{X}</math>. The relative entr...") |
Adrian Vidal (talk | contribs) (→Proof) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
where we used the fact that discrete probability distributions add up to 1. | where we used the fact that discrete probability distributions add up to 1. | ||
− | Whenever we encounter mathematical results expressed as an inequality, it always | + | Whenever we encounter mathematical results expressed as an inequality, it always makes sense to ask if the lower/upper bound is tight. By that, we mean to ask: when does equality hold? In the proof, we only produce an inequality by using <math>\ln t \ge 1 - 1/t</math>. Hence, if we know the value of <math>t</math> for which equality holds, we will understand better the tightness of our KL divergence bound. Indeed, our logarithmic inequality will become an equation if and only if <math>t = 1</math>. In the context of our proof, equality will hold if the expression <math> p(x)/q(x)</math> equals 1, which is just another way of saying that <math>p(x) = q(x)</math>. |
=== Remarks === | === Remarks === | ||
Line 31: | Line 31: | ||
These consequences are immediate by considering special cases for <math>I(X;Y|Z)</math>. The entropy of <math>X</math> can be obtained as follows: <math>H(X) = I(X;X|Z)</math>, where <math>X\perp Z</math>. The conditional entropy of <math>X</math> given <math>Z</math> is similarly obtained by removing the independence restriction: <math>H(X|Z) = I(X;X|Z)</math>. Lastly, (unconditional) mutual information can be obtained by requiring that <math>X,Y\perp Z</math>. | These consequences are immediate by considering special cases for <math>I(X;Y|Z)</math>. The entropy of <math>X</math> can be obtained as follows: <math>H(X) = I(X;X|Z)</math>, where <math>X\perp Z</math>. The conditional entropy of <math>X</math> given <math>Z</math> is similarly obtained by removing the independence restriction: <math>H(X|Z) = I(X;X|Z)</math>. Lastly, (unconditional) mutual information can be obtained by requiring that <math>X,Y\perp Z</math>. | ||
+ | === Proof === | ||
The proof is outlined as follows: | The proof is outlined as follows: | ||
* We express the mutual information between <math>X</math> and <math>Y</math> as the KL-divergence between two distributions over the joint alphabet <math>\mathcal{X}\times \mathcal{Y}</math>. | * We express the mutual information between <math>X</math> and <math>Y</math> as the KL-divergence between two distributions over the joint alphabet <math>\mathcal{X}\times \mathcal{Y}</math>. | ||
Line 47: | Line 48: | ||
Arguing for the nonnegativity of <math>I(X;Y|Z)</math> follows from the following expression: <math>I(X;Y|Z) = \sum_{z} p(z) I(X;Y|Z=z)</math>. Observe the similarity between this equation and conditional entropy. Since <math>p(z)</math> are nonnegative values, then the linear combination must also be nonnegative and the conclusion follows. | Arguing for the nonnegativity of <math>I(X;Y|Z)</math> follows from the following expression: <math>I(X;Y|Z) = \sum_{z} p(z) I(X;Y|Z=z)</math>. Observe the similarity between this equation and conditional entropy. Since <math>p(z)</math> are nonnegative values, then the linear combination must also be nonnegative and the conclusion follows. | ||
+ | |||
+ | === Connection to Markov Chains === | ||
+ | |||
+ | Conditional mutual information is closely connected with the conditional independence property in Markov chains. In particular, observe that conditioning any information measure can never increase it: <math>I(X;Y|Z) \le I(X;Y)</math> for any random variables <math>X</math>, <math>Y</math>, <math>Z</math>. | ||
+ | |||
+ | {{Note|Checkpoint: Starting with <math>I(X;Y|Z) = \sum_{z} p(z) I(X;Y|Z=z)</math>, argue that <math>I(X;Y|Z) \le I(X;Y)</math>. }} | ||
+ | |||
+ | The conditions under which equality holds, i.e. <math>I(X;Y|Z) = 0 </math>, can be seen by setting | ||
+ | |||
+ | <math> \sum_{z} p(z) I(X;Y|Z=z) = 0</math>. | ||
+ | |||
+ | Since <math>p(z) \ge 0</math> for all <math>z</math>, the summation will vanish if and only if each term <math>I(X;Y|Z=z)</math> is forced to be equal to zero, which means that if we know the value taken by <math>Z</math>, then <math>X</math> and <math>Y</math> are conditionally independent, or <math>X\perp Y|Z</math>. This is equivalent to saying that <math>X\rightarrow Z \rightarrow Y</math> forms a Markov chain, wherein knowing <math>Z</math> "breaks" the dependence between <math>X</math> and <math>Y</math>. |
Latest revision as of 01:56, 20 April 2021
Contents
Information Divergence is Nonnegative
Let be two different probability distributions over a discrete alphabet . The relative entropy or Kullback-Leibler divergence between and is defined as
We will prove the following properties of the KL divergence:
- KL divergence is always nonnegative.
- KL divergence is zero if and only if for all .
Proof
Without loss of generality, assume that the base of the logarithm is . This assumption will allow us to use the property that for all . To keep the proof a bit simpler, we further assume that and are strictly positive. Rest assured that the result still holds either distribution is zero for some values of .
where we used the fact that discrete probability distributions add up to 1.
Whenever we encounter mathematical results expressed as an inequality, it always makes sense to ask if the lower/upper bound is tight. By that, we mean to ask: when does equality hold? In the proof, we only produce an inequality by using . Hence, if we know the value of for which equality holds, we will understand better the tightness of our KL divergence bound. Indeed, our logarithmic inequality will become an equation if and only if . In the context of our proof, equality will hold if the expression equals 1, which is just another way of saying that .
Remarks
We have just shown that KL divergence is a nonnegative number obtained from comparing two discrete distributions over the same alphabet, and that it will be exactly zero precisely when the two distributions are identical. At this point, it might be tempting to assume that KL divergence is a "distance" between discrete distribution. A careful look at the expression, however, should reveal that relative entropy is not symmetric with respect to and , i.e., in general.
Conditional Mutual Information is Nonnegative
In this section, we show that for any three random variables , the conditional mutual information is nonnegative. The important consequences of this statement are the following:
- Entropy is nonnegative.
- Mutual information is nonnegative.
- Conditional entropy is nonnegative.
These consequences are immediate by considering special cases for . The entropy of can be obtained as follows: , where . The conditional entropy of given is similarly obtained by removing the independence restriction: . Lastly, (unconditional) mutual information can be obtained by requiring that .
Proof
The proof is outlined as follows:
- We express the mutual information between and as the KL-divergence between two distributions over the joint alphabet .
- We use the nonnegativity of relative entropy to argue that .
- We show that conditioning will preserve nonnegativity, i.e., .
The first part of the proof becomes apparent after writing down the definition of mutual information:
The two distributions over that we need to set up KL divergence are as follows: (1) the joint distribution , and (2) the distribution obtained by simply multiplying the marginal probabilities and . Observe that the latter distribution corresponds to the joint distribution if and were independent. With a slight abuse of notation, we now write
,
where we have used the nonnegativity of relative entropy. This small result is quite satisfying on its own because it reinforces this notion that mutual information is related to the degree of non-independence between and . Indeed, equality will hold if and only if for all , which is equivalent to and being independent.
Arguing for the nonnegativity of follows from the following expression: . Observe the similarity between this equation and conditional entropy. Since are nonnegative values, then the linear combination must also be nonnegative and the conclusion follows.
Connection to Markov Chains
Conditional mutual information is closely connected with the conditional independence property in Markov chains. In particular, observe that conditioning any information measure can never increase it: for any random variables , , .
The conditions under which equality holds, i.e. , can be seen by setting
.
Since for all , the summation will vanish if and only if each term is forced to be equal to zero, which means that if we know the value taken by , then and are conditionally independent, or . This is equivalent to saying that forms a Markov chain, wherein knowing "breaks" the dependence between and .