Difference between revisions of "The Data Processing Inequality"

Revision as of 15:34, 28 October 2020

Entropy Chain Rules

Figure 1: Entropy visualization for two random variables using Venn diagrams.

As we increase the number of random variables we are dealing with, it is important to understand how this increase affects entropy. We have previously shown that for two random variables $X$ and $Y$ :

{\begin{aligned}H\left(X,Y\right)&=H\left(X\right)+H\left(Y\mid X\right)\\&=H\left(Y\right)+H\left(X\mid Y\right)\\&=H\left(Y,X\right)\end{aligned}}

(1)

We can use Venn diagrams to visualize these relationships, as seen in Fig. 1. For three random variables $X$ , $Y$ , and $Z$ :

{\begin{aligned}H\left(X,Y,Z\right)&=H\left(X\right)+H\left(Y,Z\mid X\right)\\&=H\left(X\right)+H\left(Y\mid X\right)+H\left(Z\mid Y,X\right)\\\end{aligned}}

(2)

In general:

H\left(X_{1},X_{2},\ldots ,X_{n}\right)=\sum _{i=1}^{n}H\left(X_{i}\mid X_{i-1},X_{i-2},\ldots ,X_{1}\right)

(3)

Conditional Mutual Information

Conditional mutual information is defined as the expected value of the mutual information of two random variables given the value of a third random variable, and for three random variables $X$ , $Y$ , and $Z$ , it is defined as:

{\begin{aligned}I\left(X;Y\mid Z\right)&=\sum _{z\in Z}P\left(z\right)\sum _{y\in Y}\sum _{x\in X}P\left(x,y\mid z\right)\cdot \log _{2}{\frac {P\left(x,y\mid z\right)}{P\left(x\mid z\right)\cdot P\left(y\mid z\right)}}\\&=\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(x,y,z\right)\cdot P\left(z\right)}{P\left(x,z\right)\cdot P\left(y,z\right)}}\\\end{aligned}}

(4)

We can rewrite the definition of conditional mutual information as:

{\begin{aligned}I\left(X;Y\mid Z\right)&=\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(x,y,z\right)\cdot P\left(z\right)}{P\left(x,z\right)\cdot P\left(y,z\right)}}\\&=\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \left(\log _{2}{\frac {P\left(z\right)}{P\left(x,z\right)}}-\log _{2}{\frac {P\left(y,z\right)}{P\left(x,y,z\right)}}\right)\\&=\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(z\right)}{P\left(x,z\right)}}-\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(y,z\right)}{P\left(x,y,z\right)}}\\&=\sum _{y\in Y}P\left(y\right)\sum _{z\in Z}\sum _{x\in X}P\left(x,z\mid y\right)\cdot \log _{2}{\frac {P\left(z\right)}{P\left(x,z\right)}}-\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(y,z\right)}{P\left(x,y,z\right)}}\\&=\sum _{z\in Z}\sum _{x\in X}P\left(x,z\right)\cdot \log _{2}{\frac {P\left(z\right)}{P\left(x,z\right)}}-\sum _{z\in Z}\sum _{y\in Y}\sum _{x\in X}P\left(x,y,z\right)\cdot \log _{2}{\frac {P\left(y,z\right)}{P\left(x,y,z\right)}}\\&=H\left(X\mid Z\right)-H\left(X\mid Y,Z\right)\\\end{aligned}}

(5)

Figure 2: Entropy visualization for three random variables using Venn diagrams.

We can visualize this relationship using the Venn diagrams in Fig. 2. Compare this to our expression for the mutual information of two random variables $X$ and $Y$ :

I\left(X;Y\right)=H\left(X\right)-H\left(X\mid Y\right)

(6)

Chain Rule for Mutual Information

For random variables $X$ and $Z$ :

I\left(X;Z\right)=H\left(X\right)-H\left(X\mid Z\right)

(7)

And for random variables $X$ , $Y$ and $Z$ :

I\left(X;Y,Z\right)=H\left(X\right)-H\left(X\mid Y,Z\right)

(8)

We can then express the conditional mutual information as:

{\begin{aligned}I\left(X;Y\mid Z\right)&=H\left(X\mid Z\right)-H\left(X\mid Y,Z\right)\\&=H\left(X\right)-I\left(X;Z\right)+I\left(X;Y,Z\right)-H\left(X\right)\\&=I\left(X;Y,Z\right)-I\left(X;Z\right)\\\end{aligned}}

(9)

Rearranging, we then obtain the chain rule for mutual information:

I\left(X;Y,Z\right)=I\left(X;Y\mid Z\right)+I\left(X;Z\right)

(10)

Thus, we can extend this for additional random variables:

{\begin{aligned}I\left(Y;X_{1},X_{2},X_{3}\right)&=I\left(Y;X_{3}\mid X_{2},X_{1}\right)+I\left(Y;X_{2},X_{1}\right)\\&=I\left(Y;X_{3}\mid X_{2},X_{1}\right)+I\left(Y;X_{2}\mid X_{1}\right)+I\left(Y;X_{1}\right)\\\end{aligned}}

(11)

In general:

I\left(Y;X_{1},X_{2},\ldots ,X_{n}\right)=\sum _{i=1}^{n}I\left(Y;X_{i}\mid X_{i-1},X_{i-2},\dots ,X_{1}\right)

(12)

Markovity

A Markov Chain is a random process that describes a sequence of possible events where the probability of each event depends only on the outcome of the previous event. Thus, we say that $X,Y,Z$ is a Markov chain in this order, denoted as:

X\rightarrow Y\rightarrow Z

(13)

If we can write:

P\left(X=x,Y=y,Z=z\right)=P\left(Z=z\mid Y=y\right)\cdot P\left(Y=y\mid X=x\right)\cdot P\left(X=x\right)

(14)

Or in a more compact form:

P\left(x,y,z\right)=P\left(z\mid y\right)\cdot P\left(y\mid x\right)\cdot P\left(x\right)

(15)

We can use Markov chains to model how a signal is corrupted when passed through noisy channels. For example, if $X$ is a binary signal, it can change with a certain probability, $p$ , to $Y$ , and it can again be corrupted to produce $Z$ .

Consider the joint probability $P\left(x,z\mid y\right)$ . We can express this as:

P\left(x,z\mid y\right)={\frac {P\left(x,y,z\right)}{P\left(y\right)}}

(16)

And if $X\rightarrow Y\rightarrow Z$ , we get:

P\left(x,z\mid y\right)={\frac {P\left(z\mid y\right)\cdot P\left(y\mid x\right)\cdot P\left(x\right)}{P\left(y\right)}}

(17)

Since $P\left(y,x\right)=P\left(y\mid x\right)\cdot P\left(x\right)=P\left(x\mid y\right)\cdot P\left(y\right)$ , we can write:

P\left(x,z\mid y\right)={\frac {P\left(z\mid y\right)\cdot P\left(y,x\right)}{P\left(y\right)}}=P\left(z\mid y\right)\cdot P\left(x\mid y\right)

(18)

Thus, we can say that $X$ and $Z$ are conditionally independent given $Y$ . If we think of $X$ as some past event, and $Z$ as some future event, then the past and future events are independent if we know the present event $Y$ . Note that this property is good definition of, as well as a useful tool for checking Markovity.

We can rewrite the joint probability $P\left(x,y,z\right)$ as:

{\begin{aligned}P\left(x,y,z\right)&=P\left(z\mid y\right)\cdot P\left(y\mid x\right)\cdot P\left(x\right)\\&={\frac {P\left(z,y\right)}{P\left(y\right)}}\cdot P\left(y,x\right)\\&={\frac {P\left(z,y\right)}{P\left(y\right)}}\cdot P\left(x\mid y\right)\cdot P\left(y\right)\\&=P\left(z,y\right)\cdot P\left(x\mid y\right)\\&=P\left(x\mid y\right)\cdot P\left(y\mid z\right)\cdot P\left(z\right)\\\end{aligned}}

(19)

Therefore, if $X\rightarrow Y\rightarrow Z$ , then it follows that $Z\rightarrow Y\rightarrow X$ .

The Data Processing Inequality

Consider three random variables, $X$ , $Y$ , and $Z$ . The mutual information $I\left(X;Y,Z\right)$ can be expressed as:

{\begin{aligned}I\left(X;Y,Z\right)&=I\left(X;Y\mid Z\right)+I\left(X;Z\right)\\&=I\left(X;Z\mid Y\right)+I\left(X;Y\right)\\\end{aligned}}

(20)

If $X\rightarrow Y\rightarrow Z$ , i.e. $X$ , $Y$ , and $Z$ form a Markov chain, then $X$ is conditionally independent of $Z$ given $Y$ , resulting in $I\left(X;Z\mid Y\right)=0$ . Thus,

{\begin{aligned}I\left(X;Y\mid Z\right)+I\left(X;Z\right)&=I\left(X;Z\mid Y\right)+I\left(X;Y\right)\\I\left(X;Y\mid Z\right)+I\left(X;Z\right)&=I\left(X;Y\right)\\\end{aligned}}

(21)

And since $I\left(X;Y\mid Z\right)\geq 0$ , we get:

I\left(X;Z\right)\leq I\left(X;Y\right)

(21)

@@ Line 7: / Line 7: @@
 & =  H\left(Y\right) + H\left(X \mid Y\right) \\
 & = H\left(Y, X\right)
-\end{align}</math>|{{EquationRef|8}}}}
+\end{align}</math>|{{EquationRef|1}}}}
 We can use Venn diagrams to visualize these relationships, as seen in Fig. 1. For three random variables <math>X</math>, <math>Y</math>, and <math>Z</math>:
@@ Line 14: / Line 14: @@
 H\left(X, Y, Z\right) & = H\left(X\right) + H\left(Y, Z\mid X\right) \\
 & =  H\left(X\right) + H\left(Y \mid X\right) +  H\left(Z \mid Y, X\right)\\
-\end{align}</math>|{{EquationRef|9}}}}
+\end{align}</math>|{{EquationRef|2}}}}
 In general:
-{{NumBlk|::|<math>H\left(X_1, X_2, \ldots, X_n\right) = \sum_{i=1}^n H\left(X_i \mid X_{i-1}, X_{i-2},\ldots, X_1\right)</math>|{{EquationRef|10}}}}
+{{NumBlk|::|<math>H\left(X_1, X_2, \ldots, X_n\right) = \sum_{i=1}^n H\left(X_i \mid X_{i-1}, X_{i-2},\ldots, X_1\right)</math>|{{EquationRef|3}}}}
 == Conditional Mutual Information ==
@@ Line 26: / Line 26: @@
 I\left(X;Y\mid Z\right) & = \sum_{z\in Z} P\left(z\right) \sum_{y\in Y} \sum_{x\in X} P\left(x,y\mid z\right)\cdot \log_2\frac{P\left(x,y\mid z\right)}{P\left(x\mid z\right)\cdot P\left(y\mid z\right)} \\
 & = \sum_{z\in Z} \sum_{y\in Y} \sum_{x\in X} P\left(x, y, z\right)\cdot\log_2 \frac{P\left(x, y, z\right)\cdot P\left(z\right)}{P\left(x, z\right)\cdot P\left(y, z\right)} \\
-\end{align}</math>|{{EquationRef|11}}}}
+\end{align}</math>|{{EquationRef|4}}}}
 We can rewrite the definition of conditional mutual information as:
@@ Line 37: / Line 37: @@
 & = \sum_{z\in Z} \sum_{x\in X} P\left(x, z\right)\cdot \log_2\frac{P\left(z\right)}{P\left(x, z\right)} - \sum_{z\in Z} \sum_{y\in Y} \sum_{x\in X} P\left(x, y, z\right)\cdot \log_2\frac{P\left(y, z\right)}{P\left(x, y, z\right)} \\
 & = H\left(X\mid Z\right) - H\left(X\mid Y, Z\right)\\
-\end{align}</math>|{{EquationRef|12}}}}
+\end{align}</math>|{{EquationRef|5}}}}
 [[File:Entropy xyz venn.png|thumb|450px|Figure 2: Entropy visualization for three random variables using Venn diagrams.]]
 We can visualize this relationship using the Venn diagrams in Fig. 2. Compare this to our expression for the mutual information of two random variables <math>X</math> and <math>Y</math>:
-{{NumBlk|::|<math>I\left(X;Y\right) = H\left(X\right) - H\left(X\mid Y\right)</math>|{{EquationRef|13}}}}
+{{NumBlk|::|<math>I\left(X;Y\right) = H\left(X\right) - H\left(X\mid Y\right)</math>|{{EquationRef|6}}}}
 === Chain Rule for Mutual Information ===
 For random variables <math>X</math> and <math>Z</math>:
-{{NumBlk|::|<math>I\left(X;Z\right) = H\left(X\right) - H\left(X\mid Z\right)</math>|{{EquationRef|14}}}}
+{{NumBlk|::|<math>I\left(X;Z\right) = H\left(X\right) - H\left(X\mid Z\right)</math>|{{EquationRef|7}}}}
 And for random variables <math>X</math>, <math>Y</math> and <math>Z</math>:
-{{NumBlk|::|<math>I\left(X;Y,Z\right) = H\left(X\right) - H\left(X\mid Y,Z\right)</math>|{{EquationRef|15}}}}
+{{NumBlk|::|<math>I\left(X;Y,Z\right) = H\left(X\right) - H\left(X\mid Y,Z\right)</math>|{{EquationRef|8}}}}
 We can then express the conditional mutual information as:
@@ Line 59: / Line 59: @@
 & = H\left(X\right) - I\left(X;Z\right) + I\left(X;Y,Z\right) - H\left(X\right) \\
 & = I\left(X;Y,Z\right) - I\left(X;Z\right) \\
-\end{align}</math>|{{EquationRef|16}}}}
+\end{align}</math>|{{EquationRef|9}}}}
 Rearranging, we then obtain the chain rule for mutual information:
-{{NumBlk|::|<math>I\left(X;Y,Z\right) = I\left(X;Y\mid Z\right) + I\left(X;Z\right)</math>|{{EquationRef|17}}}}
+{{NumBlk|::|<math>I\left(X;Y,Z\right) = I\left(X;Y\mid Z\right) + I\left(X;Z\right)</math>|{{EquationRef|10}}}}
 Thus, we can extend this for additional random variables:
@@ Line 70: / Line 70: @@
 I\left(Y;X_1,X_2,X_3\right) & = I\left(Y;X_3\mid X_2,X_1\right) + I\left(Y;X_2,X_1\right) \\
 & = I\left(Y;X_3\mid X_2,X_1\right) + I\left(Y;X_2\mid X_1\right) + I\left(Y;X_1\right)\\
-\end{align}</math>|{{EquationRef|18}}}}
+\end{align}</math>|{{EquationRef|11}}}}
 In general:
-{{NumBlk|::|<math>I\left(Y;X_1, X_2,\ldots ,X_n\right) = \sum_{i=1}^n I\left(Y;X_i\mid X_{i-1}, X_{i-2},\dots ,X_1\right)</math>|{{EquationRef|19}}}}
+{{NumBlk|::|<math>I\left(Y;X_1, X_2,\ldots ,X_n\right) = \sum_{i=1}^n I\left(Y;X_i\mid X_{i-1}, X_{i-2},\dots ,X_1\right)</math>|{{EquationRef|12}}}}
 == Markovity ==
 A [https://en.wikipedia.org/wiki/Markov_chain Markov Chain] is a random process that describes a sequence of possible events where the probability of each event depends only on the outcome of the previous event. Thus, we say that <math>X, Y, Z</math> is a Markov chain in this order, denoted as:
-{{NumBlk|::|<math>X \rightarrow Y \rightarrow Z</math>|{{EquationRef|1}}}}
+{{NumBlk|::|<math>X \rightarrow Y \rightarrow Z</math>|{{EquationRef|13}}}}
 If we can write:
-{{NumBlk|::|<math>P\left(X=x, Y=y, Z=z\right) = P\left(Z=z\mid Y=y\right)\cdot P\left(Y=y\mid X=x\right) \cdot P\left(X=x\right)</math>|{{EquationRef|2}}}}
+{{NumBlk|::|<math>P\left(X=x, Y=y, Z=z\right) = P\left(Z=z\mid Y=y\right)\cdot P\left(Y=y\mid X=x\right) \cdot P\left(X=x\right)</math>|{{EquationRef|14}}}}
 Or in a more compact form:
-{{NumBlk|::|<math>P\left(x, y, z\right) = P\left(z\mid y\right)\cdot P\left(y\mid x\right) \cdot P\left(x\right)</math>|{{EquationRef|3}}}}
+{{NumBlk|::|<math>P\left(x, y, z\right) = P\left(z\mid y\right)\cdot P\left(y\mid x\right) \cdot P\left(x\right)</math>|{{EquationRef|15}}}}
 We can use Markov chains to model how a signal is corrupted when passed through noisy channels. For example, if <math>X</math> is a binary signal, it can change with a certain probability, <math>p</math>, to <math>Y</math>, and it can again be corrupted to produce <math>Z</math>.
@@ Line 93: / Line 93: @@
 Consider the joint probability <math>P\left(x, z\mid y\right)</math>. We can express this as:
-{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(x, y, z\right)}{P\left(y\right)}</math>|{{EquationRef|4}}}}
+{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(x, y, z\right)}{P\left(y\right)}</math>|{{EquationRef|16}}}}
 And if <math>X \rightarrow Y \rightarrow Z</math>, we get:
-{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(z\mid y\right)\cdot P\left(y\mid x\right) \cdot P\left(x\right)}{P\left(y\right)}</math>|{{EquationRef|5}}}}
+{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(z\mid y\right)\cdot P\left(y\mid x\right) \cdot P\left(x\right)}{P\left(y\right)}</math>|{{EquationRef|17}}}}
 Since <math>P\left(y,x\right)=P\left(y\mid x\right)\cdot P\left(x\right) = P\left(x\mid y\right)\cdot P\left(y\right)</math>, we can write:
-{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(z\mid y\right)\cdot P\left(y, x\right)}{P\left(y\right)}=P\left(z\mid y\right)\cdot P\left(x\mid y\right)</math>|{{EquationRef|6}}}}
+{{NumBlk|::|<math>P\left(x, z\mid y\right) = \frac{P\left(z\mid y\right)\cdot P\left(y, x\right)}{P\left(y\right)}=P\left(z\mid y\right)\cdot P\left(x\mid y\right)</math>|{{EquationRef|18}}}}
 Thus, we can say that <math>X</math> and <math>Z</math> are conditionally independent given <math>Y</math>. If we think of <math>X</math> as some past event, and <math>Z</math> as some future event, then the past and future events are independent if we know the present event <math>Y</math>. Note that this property is good definition of, as well as a useful tool for checking Markovity.
@@ Line 113: / Line 113: @@
 & = P\left(z, y\right)\cdot P\left(x\mid y\right)\\
 & = P\left(x\mid y\right) \cdot P\left(y\mid z\right) \cdot P\left(z\right)\\
-\end{align}</math>|{{EquationRef|7}}}}
+\end{align}</math>|{{EquationRef|19}}}}
 Therefore, if <math>X \rightarrow Y \rightarrow Z</math>, then it follows that <math>Z \rightarrow Y \rightarrow X</math>.

Difference between revisions of "The Data Processing Inequality"

Revision as of 15:34, 28 October 2020

Contents

Entropy Chain Rules

Conditional Mutual Information

Chain Rule for Mutual Information

Markovity

The Data Processing Inequality

Sufficient Statistics

Fano's Inequality

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools