Difference between revisions of "Asymptotic Equipartition Property"

From Microlab Classes
Jump to navigation Jump to search
(Created page with "== The (Weak) Law of Large Numbers == == Products of Random Variables == == The Asymptotic Equipartition Property (AEP) == == Typical Sets ==")
 
Line 1: Line 1:
== The (Weak) Law of Large Numbers ==
+
Essentially, the Asymptotic Equipartition Property tell us that "almost all long sequences of fixed length are almost equally likely." The AEP is an important requirement for the validity of Shannon's noisy channel coding theorem.
  
== Products of Random Variables ==
+
== Typical sets ==
 +
We make our notion of "almost all" and "almost equal" precise using a tolerance parameter <math>\epsilon</math>. For every <math>\epsilon>0</math>, define the <math>\epsilon</math> typical set <math>A^{(n)}_\epsilon</math> be the set of all <math>n</math>-sequences from <math>\mathcal{X}^n</math> whose empirical entropies are less than <math>\epsilon</math> away from the <math>H(X)</math>:
  
== The Asymptotic Equipartition Property (AEP) ==
+
<math>A_\epsilon ^{(n)} = \left\{x^n \in \mathcal{X}^n : \left|-\frac{1}{n}\log p(x^n) - H(X)\right| < \epsilon\right\}, </math>
  
== Typical Sets ==
+
where we assume that the elements of <math>x^n</math> are drawn as i.i.d. samples of <math>X</math> in which case <math>p(x^n)</math> can be expanded as the product of the probabilities of all of its entries. Observe that the i.i.d. assumption implies that only the relative frequency of entries matter.
 +
 
 +
=== Numerical example ===
 +
 
 +
Consider the following binary sequence of length <math>n=20</math> where <math>P(X = 1) = 2/3</math>:
 +
 
 +
<math>x^{20} = [1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0]</math>
 +
 
 +
That sequence consists of 11 ones and nine zeros, which corresponds to the probability <math>p(x^{20}) = (2/3)^{11} (1/3)^9 = 5.8744 \times 10^{-7}</math>. This probability will not change if we permute the ones and zeros within the sequence. To check if the sequence we have is <math>\epsilon</math>-typical, we first have to set a value for <math>\epsilon</math>, say <math> \epsilon = 0.01</math>.
 +
 
 +
<math>\left|-\frac{1}{n}\log p(x^n) - H(X)\right| = \left|-\frac{1}{20} \log (5.8744 \times 10^{-7}) - 0.9183\right| = 0.1167 > 0.01</math>
 +
 
 +
Hence, the sequence we have is not <math>\epsilon</math>-typical for <math>\epsilon = 0.01</math>.
 +
 
 +
=== Special case: binary sequences ===
 +
 
 +
Let us now try to characterize typical sets for binary sequences. Let <math>X</math> be a binary random variable over the alphabet <math>\{0, 1\}</math> such that <math>P(X=1) = p</math>. Let us fix <math>n</math> (the sequence length) and <math>\epsilon>0</math> (the typicality threshold). If we drawn an <math>n</math>-sequence <math>x^n</math>, we will get <math>k</math> ones and <math>n-k</math> zeros, which corresponds to the probability <math>p(x^n) = p^k (1-p)^{n-k}</math>.
 +
 
 +
Now define <math>\hat{p} = k/n</math>, the relative frequency of ones in the sequence so that
 +
 
 +
<math> -\frac{1}{n} \log p(x^n) = -\frac{1}{n} (k \log p + (n - k) \log (1 - p) ) = -\frac{k}{n} \log p - \left(1 - \frac{n}{k}\right) \log (1 - p) = - \hat{p} \log p - \left(1 - \hat{p}\right)\log (1 - p)</math>.
 +
 
 +
Using the definition of typical sets and <math>H(X) = -p \log p - (1 - p)\log (1 - p) </math>, the sequence <math>x^n</math> is <math>\epsilon</math>-typical if and only if
 +
 
 +
<math>
 +
\begin{align}
 +
\left| \left[- \hat{p} \log p - \left(1 - \hat{p}\right)\log (1 - p) \right] - \left[-p \log p - (1 - p)\log (1 - p)\right]\right| &< \epsilon \\
 +
\left| - (\hat{p} - p) \log p + \left(p - \hat{p}\right)\log (1 - p) \right| &< \epsilon
 +
\end{align}
 +
</math>
 +
 
 +
Isolating <math>|\hat{p} - p|</math>,
 +
 
 +
<math>|\hat{p} - p| < \frac{\epsilon}{\left|\log \frac{p}{1 - p}\right|}  </math>
 +
 
 +
The above inequality is a fairly satisfying result in that the relative frequencies of zeros and ones in the sequence <math>x^n</math> should be close to the distribution. The parameter <math>\epsilon</math> places a well-defined upper bound on the deviation <math>|\hat{p} - p|</math>, beyond which the sequence is no longer <math>\epsilon</math>-typical.
 +
 
 +
Checkpoint: Use the definition to show that if <math>X</math> is a equiprobable over any binary alphabet, then any <math>n</math>-sequence is <math>\epsilon</math>-typical for all <math> \epsilon > 0 </math>.
 +
 
 +
Checkpoint: Use the derived inequality in this section to show that there are no <math>\epsilon</math>-typical sequences for <math>n=20</math> and <math>\epsilon = 0.01</math> if the sequence is drawn i.i.d from <math>\mathcal{X} = \{0,1\}</math> where <math>P(X = 1) = 2/3</math>.
 +
 
 +
Just how large are these typical sets? Let us fix an <math>\epsilon</math>, and count <math>n</math>-sequences drawn from a distribution <math>p(X)</math>.
 +
 
 +
=== Behavior for large <math>n</math> ===
 +
In the previous section, you were asked to show that there are no <math>\epsilon</math>-typical sequences for <math>\epsilon = 0.01</math> and <math>n=20</math> when the sequence is drawn from the probability distribution <math>[1/3, 2/3]</math> over a binary alphabet. Although the proof was given as an exercise, we can see (informally) that it is hard to get close to the true probabilities for relatively small <math>n</math>. This phenomenon becomes more apparent for shorter sequences, say <math>n=5</math>. If you draw only five random bits, the relative frequencies that you can get are very limited: <math>\hat{p} \in \{0, 1/5, 2/5, 3/5, 4/5, 1\}</math>.
 +
 
 +
In information theory, fascinating properties emerge if we consider very long sequences. Indeed, for large <math>n</math>, we get a much richer set of relative frequencies that makes it easier to approach the true probabilities of the underlying distribution. Let us still fix <math>\epsilon = 0.01</math> and draw bits from the distribution <math>[1/3, 2/3]</math>, but now let us inspect what happens if we increase <math>n</math>.
 +
{| class="wikitable"
 +
|-
 +
! scope="col"| <math>n</math>
 +
! scope="col"| <math>\mathbb{P}\left(X^n \in A_\epsilon^{(n)}\right)</math>
 +
|-
 +
| 50
 +
| 0.117828
 +
|-
 +
| 100
 +
| 0.167524
 +
|-
 +
| 500
 +
| 0.364729
 +
|-
 +
| 1000
 +
| 0.497573
 +
|-
 +
| 5000
 +
| 0.866395
 +
|-
 +
| 10000
 +
| 0.966111
 +
|-
 +
| 50000
 +
| 0.999998
 +
|-
 +
|}
 +
 
 +
The table demonstrates that for <math>n</math> sufficiently large, the probability of an <math>n</math>-sequence being <math>\epsilon</math>-typical comes very close to 1. This is the Asymptotic Equipartition Property. Formally, it states that for every <math>\epsilon > 0</math>, there exists a minimum sequence length <math>n_0</math> such that for all <math>n > n_0</math>, <math>\mathbb{P}\left(X^n \in A_\epsilon^{(n)}\right) > 1 - \epsilon</math>.
 +
 
 +
== Consequences ==

Revision as of 01:27, 19 April 2021

Essentially, the Asymptotic Equipartition Property tell us that "almost all long sequences of fixed length are almost equally likely." The AEP is an important requirement for the validity of Shannon's noisy channel coding theorem.

Typical sets

We make our notion of "almost all" and "almost equal" precise using a tolerance parameter . For every , define the typical set be the set of all -sequences from whose empirical entropies are less than away from the :

where we assume that the elements of are drawn as i.i.d. samples of in which case can be expanded as the product of the probabilities of all of its entries. Observe that the i.i.d. assumption implies that only the relative frequency of entries matter.

Numerical example

Consider the following binary sequence of length where :

That sequence consists of 11 ones and nine zeros, which corresponds to the probability . This probability will not change if we permute the ones and zeros within the sequence. To check if the sequence we have is -typical, we first have to set a value for , say .

Hence, the sequence we have is not -typical for .

Special case: binary sequences

Let us now try to characterize typical sets for binary sequences. Let be a binary random variable over the alphabet such that . Let us fix (the sequence length) and (the typicality threshold). If we drawn an -sequence , we will get ones and zeros, which corresponds to the probability .

Now define , the relative frequency of ones in the sequence so that

.

Using the definition of typical sets and , the sequence is -typical if and only if

Isolating ,

The above inequality is a fairly satisfying result in that the relative frequencies of zeros and ones in the sequence should be close to the distribution. The parameter places a well-defined upper bound on the deviation , beyond which the sequence is no longer -typical.

Checkpoint: Use the definition to show that if is a equiprobable over any binary alphabet, then any -sequence is -typical for all .

Checkpoint: Use the derived inequality in this section to show that there are no -typical sequences for and if the sequence is drawn i.i.d from where .

Just how large are these typical sets? Let us fix an , and count -sequences drawn from a distribution .

Behavior for large

In the previous section, you were asked to show that there are no -typical sequences for and when the sequence is drawn from the probability distribution over a binary alphabet. Although the proof was given as an exercise, we can see (informally) that it is hard to get close to the true probabilities for relatively small . This phenomenon becomes more apparent for shorter sequences, say . If you draw only five random bits, the relative frequencies that you can get are very limited: .

In information theory, fascinating properties emerge if we consider very long sequences. Indeed, for large , we get a much richer set of relative frequencies that makes it easier to approach the true probabilities of the underlying distribution. Let us still fix and draw bits from the distribution , but now let us inspect what happens if we increase .

50 0.117828
100 0.167524
500 0.364729
1000 0.497573
5000 0.866395
10000 0.966111
50000 0.999998

The table demonstrates that for sufficiently large, the probability of an -sequence being -typical comes very close to 1. This is the Asymptotic Equipartition Property. Formally, it states that for every , there exists a minimum sequence length such that for all , .

Consequences