Entropy

Definition

In information theory, we always work with random variables. In his landmark paper, Shannon introduced a function $H$ that takes in a random variable,

$H(X)=\sum _{x\in {\mathcal {X}}}p(x)\log {\frac {1}{p(x)}}=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)$ ,

where ${\mathcal {X}}$ is called the alphabet of $X$ , and $H(X)$ is the entropy of $X$ . When calculating entropy and other information measures, we use $0\log 0=0$ . Unless otherwise specified, we assume that all logarithms are in base 2. Under this assumption, the value $H(X)$ is to be interpreted as having the unit bits.

One often thinks of the letters A, B, C.. upon hearing the term "alphabet". In information theory, this notion is abstracted to mean, "the set of possible symbols." The choice of the letter $H$ is believed to stem from the closely related notion of entropy in statistical mechanics.

Some properties

A first glance at the definition might make you scratch your head, and question how this function is related to information. However bizarre the definition initially seems, it does have the following intuitive properties:

$H(X)$ is always non-negative.
For a fixed alphabet ${\mathcal {X}}$ , $H(X)$ is maximum when all symbols are equiprobable.
If two random variables $X$ and $Y$ are independent, the (paired) random variable $Z=(X,Y)$ has entropy $H(Z)=H(X)+H(Y)$

The first bullet tells us that information is never negative for all random variables $X$ . The second bullet tells us that given a set of symbols, maximum information can be obtained using a uniform distribution, which can be thought of as the "most random" distribution possible.

The third bullet needs to be explained more clearly. The requirement of independence is important in the following sense: if we take $Y=X$ , then $Z$ will essentially just duplicate the random variable (or source) $X$ . For $Z$ to have more information than $X$ , the second random variable $Y$ must offer something new. It turns out that $Y$ contributes maximum additional information only when $X$ and $Y$ are independent, i.e., knowing $X$ must not reveal anything about $Y$ . Under the independence assumption, all of the information $H(Y)$ is fully added (without overlap) to produce $H(Z)=H(X)+H(Y)$ .

Checkpoint: Let

X

and

Y

be random variables with alphabets

{\mathcal {X}}=\{a,b,c\}

and

{\mathcal {Y}}=\{1,2\}

, respectively. What is the alphabet of the random variable

Z=(X,Y)

?

The role of proofs in CoE 161

From the previous section, we saw that despite the intuitive properties of entropy, it is unclear how they came about. In this intentionally non-linear "storytelling," we were first introduced to a character $H$ and then we skipped a bit ahead to where $H$ has developed some properties. The link between the two can be made by writing mathematical proofs, a series of logically connected statements leading from a premise to a conclusion. In CoE 161, we discuss just enough proofs so that there is enough "character" development to follow but not too much so that you lose sight of the big picture. (We would not want to follow every minute of a character's life, right?)

Proving identities can be daunting to some, but there are small ways which can help you understand a mathematical result.

Verifying the statement for a specific example

One way is to try specific examples. The more varied your examples are, the more you become convinced that there is some generality in the statement. Consider a random variable $X$ over the alphabet ${\mathcal {X}}=\{{\text{apple}},{\text{orange}},{\text{lemon}}\}$ , with respective probabilities 1/2, 1/3, and 1/6. Let us verify that the entropy of $X$ is non-negative, as stated in the first property.

$H(X)=-\sum _{x\in {\mathcal {X}}}=-{\frac {1}{2}}\log {\frac {1}{2}}--{\frac {1}{3}}\log {\frac {1}{3}}--{\frac {1}{6}}\log {\frac {1}{6}}=0.4392,$

where we used base-10 logarithms. This verifies the property for this single, specific example. The power of proofs lies in its ability to make sweeping statements over a large class of objects. With enough discipline, the ability (and initiative!) to verify identities for specific instances can go a long way.

Checkpoint: Is the first property still true if we replace the alphabet by

\{1,2,3\}

? If we use a different base for calculating logarithms?

Proving the statement for a family of instances

Yet another way is to actually prove the statement, but only to a smaller extent. Let us use the third property and prove it assuming that $X$ and $Y$ are binary random variables over the alphabet ${\mathcal {X}}={\mathcal {Y}}=\{0,1\}$ . Note that this will not cover all possible random variables, but is definitely more general than using a specific (numerical) example. We are now ready to prove the statement for binary rvs.

Let $p$ be the probability that $X=1$ , and $q$ be the probability that $Y=1$ . Since $X,Y$ are independent, the probability mass function (pmf) of $Z=(X,Y)$ is simply obtained by multiplying the marginal pmfs of $X$ and $Y$ :

	$Y=0$	$Y=1$
$X=0$	$(1-p)(1-q)$	$q(1-p)$
$X=1$	$p(1-q)$	$pq$

Using the definition, the entropy of $Z$ is given by

$H(Z)=-\sum _{z\in {\mathcal {Z}}}p(z)\log p(z)=-(1-p)(1-q)\log[(1-p)(1-q)]-q(1-p)\log[q(1-p)]-p(1-q)\log[p(1-q)]-pq\log(pq)$ .

We can expand the logarithm of product terms, and see that the resulting expression can be grouped into four classes, each containing one of the following: $\log p$ , $\log q$ , $\log(1-p)$ , and $\log(1-q)$ . For clarity, let us call these partial sums $H_{p}$ , $H_{q}$ , $H_{1-p}$ , and $H_{1-q}$ .