Entropy

From Microlab Classes
Jump to navigation Jump to search

Definition

In information theory, we always work with random variables. In his landmark paper, Shannon introduced a function that takes in a random variable,

,

where is called the alphabet of , and is the entropy of . One often thinks of the letters A, B, C.. upon hearing the term alphabet. In information theory, this notion is abstracted to mean, "the set of possible symbols." The choice of the letter is believed to stem from the closely related notion of entropy in statistical mechanics.

Some properties

A first glance at the definition might make you scratch your head, and question how this function is related to information. However bizarre the definition initially seems, it does have the following intuitive properties:

  • is always non-negative.
  • For a fixed alphabet , is maximum when all symbols are equiprobable.
  • If two random variables and are independent, the (paired) random variable has entropy

The first bullet tells us that information is never negative for all random variables . The second bullet tells us that given a set of symbols, maximum information can be obtained using a uniform distribution, which can be thought of as the "most random" distribution possible.

The third bullet needs to be explained more clearly. The requirement of independence is important in the following sense: if we take , then will essentially just duplicate the random variable (or source) . For to have more information than , the second random variable must offer something new. It turns out that contributes maximum additional information only when and are independent, i.e., knowing must not reveal anything about . Under the independence assumption, all of the information is fully added (without overlap) to produce .

Checkpoint: Let and be random variables with alphabets and , respectively. What is the alphabet of the random variable ?

The role of proofs in CoE 161

From the previous section, we saw that despite the intuitive properties of entropy, it is unclear how they came about. In this intentionally non-linear "storytelling," we were first introduced to a character and then we skipped a bit ahead to where has developed some properties. The link between the two can be made by writing mathematical proofs, a series of logically connected statements leading from a premise to a conclusion. In CoE 161, we discuss just enough proofs so that there is enough "character" development to follow but not too much so that you lose sight of the big picture. (We would not want to follow every minute of a character's life, right?)

Proving identities can be daunting to some, but there are small ways which can help you understand a mathematical result.

Verifying the statement for a specific example

One way is to try specific examples. The more varied your examples are, the more you become convinced that there is some generality in the statement. Consider a random variable over the alphabet , with respective probabilities 1/2, 1/3, and 1/6. Let us verify that the entropy of is non-negative, as stated in the first property.

where we used base-10 logarithms. This verifies the property for this single, specific example. The power of proofs lies in its ability to make sweeping statements over a large class of objects. Without enough discipline, the ability (and initiative!) to verify identities for specific instances can go a long way.

Checkpoint: Is the first property still true if we replace the alphabet by ? If we use a different base for calculating logarithms?

Proving the statement for a family of instances

Yet another way is to actually prove the statement, but only to a smaller extent. Let us use the third property and prove it assuming that and are binary random variables over the alphabet . Note that this will not cover all possible random variables, but is definitely more general than using a specific (numerical) example. We are now ready to prove the statement for binary rvs.

Let be the probability that , and be the probability that . Since are independent, the probability mass function (pmf) of is simply obtained by multiplying the marginal pmfs of and :

Using the definition, the entropy of is given by

.

We can expand the logarithm of product terms, and see that the resulting expression can be grouped into four classes, each containing one of the following: , , , and . For clarity, let us call these partial sums , , , and .

The conclusion follows after noting that and .

Checkpoint: Verify that the first two properties of entropy are true for binary random variables over the alphabet .