Conditional Entropy and Mutual Information

When we work with multiple information sources (random variables) $X$ and $Y$ , the following details are useful:

How much information overlap is there between $X$ and $Y$ ?
How much information is contained in one but not present in the other?

In information theory, we formalize the first question using the concept of mutual information, while the second question is addressed using the concept of conditional entropy.

Definition

For two random variables $X$ and $Y$ , the mutual information between them, denoted by $I(X;Y)$ is given by

$I(X;Y)=\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(x,y)\log {\frac {p(x,y)}{p(x)p(y)}}$ ,

and the conditional entropy of $X$ given $Y$ is given by

$H(X|Y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(x,y)\log p(x|y)$

Observe that the fraction in the definition of $I(X;Y)$ consists of (1) the joint distribution $p(x,y)$ , and (2) the product of marginal distributions $p(x)p(y)$ . $I(X;Y)$ is a measure of how far $X$ and $Y$ is from being independent, in which case $p(x,y)=p(x)p(y)$ and the entire summation vanishes since $\log 1=0$ . In our discussions, we simplify the notation so that $p(x,y)$ and $p(y,x)$ both refer to the probability that $X=x,Y=y$ . With this remark, $I(X;Y)=I(Y;X)$ . However, conditional entropy is not symmetric, i.e. in general $H(X|Y)$ is not equal to $H(Y|X)$ .

When $X$ and $Y$ are independent, all conditional probabilities $p(x|y)$ reduces to $p(x)$ , i.e., conditioning on $Y$ does not change the probability of $X$ . Hence,

$H(X|Y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(x,y)\log p(x|y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(y)p(x|y)\log p(x|y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(y)p(x)\log p(x)=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)$

Thus, $H(X|Y)=H(X)$ if $X$ and $Y$ are independent.

Checkpoint: Why did

y

disappear in the last step?

Even if we do not require $X$ and $Y$ to be independent, rearranging terms reveals yet another intuitive property.

$H(X|Y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(x,y)\log p(x|y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}p(y)p(x|y)\log p(x|y)=\sum _{y\in {\mathcal {Y}}}p(y)\left(-\sum _{x\in {\mathcal {X}}}p(x|y)\log p(x|y)\right)$

The summation inside the parenthesis should be very familiar. Indeed, if we ignore the conditioning on $y$ , we see that the summation is essentially a calculation of entropy. Together with the $p(y)$ , $H(X|Y)$ is simply a weighted average of multiple entropy values! For a detailed discussion of this property with numerical example, please see Activity 2.

Entropy, conditional entropy, and mutual information

Below are some properties that relate all information measures we discussed so far (entropy, conditional entropy, mutual information):

$I(X;Y)\geq 0$ where equality holds if and only if $X$ and $Y$ are independent.
$H(X|Y)=H(X)-I(X;Y)$
For all random variables $X$ and $Y$ , $H(X,Y)=H(X)+H(Y)-I(X;Y)$

The first property is crucial in understanding the implications of mutual information. Wherever $I(X;Y)$ occurs in important properties, it would greatly help to check if a property with $I(X;Y)=0$ still makes sense, knowing that $X$ and $Y$ would be independent under such constraint.

Checkpoint: Do the second and third properties make sense when

I(X;Y)=0

?

In many cases, however, $I(X;Y)$ would be strictly positive. The second property then implies that in many cases, $H(X|Y)$ would be strictly less than $H(X)$ , i.e., conditioning reduces entropy, and $I(X;Y)$ is precisely the amount of reduction in entropy/information about $X$ when we know $Y$ .

The Venn diagram shown below serves as a visual aid to remember the relationships between entropy, conditional entropy and mutual information.

Extensions

What happens when we deal with more than two random variables? To facilitate the discussion, let us recall the chain rule for joint distributions.

Let $X_{1},X_{2},...,X_{n}$ be a sequence of discrete random variables. Then, their joint distribution can be factored as follows

$p(X_{1},X_{2},...X_{n})=p(X_{1})p(X_{2}|X_{1})p(X_{3}|X_{1},X_{2})...p(X_{n}|X_{1},X_{2},...X_{n-1})$ ,

Chain rule for entropy

The chain rule for (joint) entropy is very similar to the above expansion, but we use additions instead of multiplications.

$H(X_{1},X_{2},...,X_{n})=H(X_{1})+H(X_{2}|X_{1})+H(X_{3}|X_{1},X_{2})+...+H(X_{n}|X_{1},X_{2},...,X_{n-1})$

Checkpoint: Using the properties in the previous section, show that

H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

.

Although we do not supply a complete proof here, this fact should not be too surprising since entropy operates on logarithms of probabilities and logarithms of product terms expand to sums of logarithms.

Proof of chain rule for $n=3$

Let us see that the statement is true for $n=3$ . Let $X_{1},X_{2},X_{3}$ be three discrete random variables. The idea with the proof is that we operate on two random variables at a time, since prior to the chain rule we only know that $H(X,Y)=H(X)+X(Y|X)$ . We can write $H(X_{1},X_{2},X_{3})=H((X_{1},X_{2}),X_{3})$ , where we bundle $X_{1}$ and $X_{2}$ into one random variable $(X_{1},X_{2})$ .

The proof now proceeds as follows:

$H(X_{1},X_{2},X_{3})=H((X_{1},X_{2}),X_{3})=H(X_{1},X_{2})+H(X_{3}|X_{1},X_{2})=[H(X_{1})+H(X_{2}|X_{1})]+H(X_{3}|X_{1},X_{2})$

One interpretation of the chain rule shown is that to obtain the total information (joint entropy) of $X_{1},X_{2},X_{3}$ as a whole, we can

obtain information about $X_{1}$ first without any prior knowledge: $H(X_{1})$ , then
obtain information about $X_{2}$ with knowledge of $X_{1}$ : $H(X_{2}|X_{1})$ , then
obtain information about $X_{3}$ with knowledge of $X_{1}$ and $X_{2}$ : $H(X_{3}|X_{1},X_{2})$ .

For $n>3$ , we can proceed by induction. Below is the sketch of the proof:

(Base step) For a fixed n=k, assume that any collection of $k$ random variables satisfies the chain rule.
(Induction step) When n = k+1, write the joint entropy of $k+1$ rvs as follows: $H(X_{1},...,X_{k},X_{k+1})=H((X_{1},...,X_{k}),X_{k}+1)$ , where the first $k$ are bundled together into one random variable.
Use the two-variable chain rule $H(X,Y)=H(X)+H(Y|X)$ and the base step to show that $H(X_{1},...,X_{k},X_{k+1})=H(X_{1})+H(X_{2}|X_{1})+...+H(X_{k+1}|X_{1},...,X_{k})$ .

Checkpoint: The order of "obtaining information" is irrelevant in calculating the joint entropy of multiple rvs. Write the chain rule if we proceed by obtaining information in the following order:

X_{2},X_{3},X_{1}

.

Checkpoint: The chain we have now works for any collection of random variables. Can you figure out how to simplify the chain rule for Markov chains? (The answer will be discussed in Module 3.)

Conditional Entropy and Mutual Information

Contents

Definition

Entropy, conditional entropy, and mutual information

Extensions

Chain rule for entropy

Proof of chain rule for $n=3$

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Conditional Entropy and Mutual Information

Contents

Definition

Entropy, conditional entropy, and mutual information

Extensions

Chain rule for entropy

Proof of chain rule for n = 3 {\displaystyle n=3}

Navigation menu

Search

Proof of chain rule for $n=3$