Difference between revisions of "Jointly typical sequences"

From Microlab Classes
Jump to navigation Jump to search
(Created page with "== Definition == == Illustration == == Example: Binary symmetric channel ==")
 
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Definition ==
+
In this article, we discuss an extension of typical sets to jointly distributed random variables.
  
== Illustration ==
+
==Definition==
 +
Let <math>(x^n,y^n)</math> be an <math>n</math>-sequence of pairs drawn from the set <math>\mathcal{X} \times \mathcal{Y}</math>. For a fixed <math>\epsilon > 0</math>, <math>(x^n, y^n)</math> is said to be jointly typical with respect to random variables <math>X</math> and <math>Y</math> if the three conditions below are all satisfied:
 +
* <math>|-\frac{1}{n} \log p(x^n) - H(X)| < \epsilon </math>
 +
* <math>|-\frac{1}{n} \log p(y^n) - H(Y)| < \epsilon </math>
 +
* <math>|-\frac{1}{n} \log p(x^n,y^n) - H(X,Y)| < \epsilon </math>
  
== Example: Binary symmetric channel ==
+
From the definition, we see that the notion of joint typicality is stronger than the previous formulation. If we only needed the sequence of pairs <math>(x_i, y_i)</math> for <math> i = 1, 2, \cdots, n</math>, we could simply invoke the Asymptotic Equipartition Property with the random variable <math>X</math> replaced by <math>(X,Y)</math>. The first two properties further constrain the component sequences <math>(x_1, x_2, \cdots, x_n)</math> and <math>(y_1, y_2, ..., y_n)</math> beyond the earlier requirement of AEP. Altogether, these three properties guarantee us that sufficiently long sequences behave near their true joint and marginal distributions with very high probability.
 +
 
 +
== Example: binary symmetric channel ==
 +
For our example, we consider the sequence of input-output pairs <math>(x_i, y_i)</math> of a binary symmetric channel (BSC). A BSC takes in inputs from the input alphabet <math>\mathcal{X} = \{0,1\}</math> and produces outputs which also belong to the same alphabet <math>\mathcal{Y} = \{0, 1\}</math>. If we write the joint distribution as <math>p(X,Y) = p(X) p(Y|X)</math>, we can generate the pairs by first generating a value <math>x_i</math> using <math>p(X)</math>, and then drawing a value <math>y_i</math> using the conditional distribution <math>p(Y|X=x_i)</math>. For the BSC, <math>p(Y|X) = p_e</math> if <math>Y\ne X</math> and <math>p(Y|X) = 1 - p_e </math> if <math>Y=X</math>, where <math>0 \le p_e \le 1/2</math> is called the crossover probability.
 +
 
 +
In our calculation, the input <math>X</math> is assumed to be distributed such that <math>P(X=1) = 2/3</math>, and the crossover probability is set to <math>p_e = 1/10</math>. The typicality threshold is set at <math>\epsilon = 0.01</math>. From <math>p(X)</math> and <math>p(Y|X)</math>, we obtain the following joint distribution:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! scope="col"|
 +
! scope="col"| <math>Y=0</math>
 +
! scope="col"| <math>Y=1</math>
 +
|-
 +
! scope="row"| <math>X=0</math>
 +
| 3/10
 +
| 1/30
 +
|-
 +
! scope="row"| <math>X=1</math>
 +
| 1/15
 +
| 3/5
 +
|-
 +
|}
 +
 
 +
The table below shows the probability of producing a jointly typical sequence as the blocklength increases.
 +
{| class="wikitable"
 +
|-
 +
! scope="col"| <math>n</math>
 +
! scope="col"| <math>\mathbb{P}\left(X^n \in A_\epsilon^{(n)}\right)</math>
 +
|-
 +
| 50
 +
| 0
 +
|-
 +
| 100
 +
| 0.003050
 +
|-
 +
| 500
 +
| 0.155243
 +
|-
 +
| 1000
 +
| 0.317690
 +
|-
 +
| 5000
 +
| 0.703099
 +
|-
 +
| 10000
 +
| 0.818271
 +
|-
 +
| 50000
 +
| 0.982649
 +
|-
 +
|}
 +
 
 +
To produce the values in the table, we enumerate the relatively frequencies of pairs that will result in a typical sequence for a fixed <math>n</math>. In our example, you can think of this a "profile" consisting of four numbers, <math>(k_{00}, k_{01}, k_{10}, k_{11})</math>, where <math>k_{ij}</math> is the number of <math>(i,j)</math> pairs in the sequence for <math>i,j\in\{0,1\}</math>. At <math>n=100</math>, there is only one such profile: <math>(k_{00}, k_{01}, k_{10}, k_{11}) = (30, 3, 7, 60)</math>. Using the multinomial theorem, we can calculate the probability of this profile as
 +
 
 +
<math>\frac{100!}{30!3!7!60!}(3/10)^{30} (1/30)^3 (1/15)^7 (3/5)^{60} \approx 0.003050</math>
 +
 
 +
For larger <math>n</math>, the number of profiles increases and concentrates the total probability weight near the underlying joint distribution <math>p(X,Y)</math>.
 +
 
 +
== Important Properties ==
 +
Fix <math>\epsilon > 0</math> can let <math>A_{\epsilon}^{(n)}</math> be the typical set of pairs of <math>n</math>-sequences <math>X^n</math> and <math>Y^n</math>.
 +
 
 +
* The probability of that an <math>n</math>-sequence drawn i.i.d using <math>p(X,Y)</math> is <math>\epsilon</math>-typical approaches 1 for sufficiently large n.
 +
* The number of elements in <math>A^{(n)}_{\epsilon}</math> is at most <math>2^{n(H(X,Y) + \epsilon}</math>.
 +
* If <math>\tilde{X}^n</math> and <math>\tilde{Y}^n</math> are independent but with the marginal probabilities <math>p(x^n)</math> and <math>p(y^n)</math>, then for sufficiently large <math>n</math>, <math>(1 - \epsilon) 2^{-n(I(X;Y) + 3 \epsilon)}\le \mathbb{P}((\tilde{X}^n, \tilde{Y}^n) \in A_{\epsilon}^{(n)}) \le 2^{-n(I(X;Y) - 3\epsilon)}</math>
 +
 
 +
The first two properties are direct extensions of the typical set property for a single random variable <math>X</math>. Here, we simply replaced <math>X</math> by <math>(X,Y)</math>.
 +
 
 +
The third property is more meaningful in the context of data transmission and is in fact used in the proof of Shannon's channel coding theorem. We can interpret <math>Y^n</math> as the channel output corresponding to the channel input <math>X^n</math>. In most practical channels, <math>X^n</math> and <math>Y^n</math> are not independent. Under the lens of data transmission, <math>\tilde{X}^n</math> can be interpreted as a codeword that does not correspond to the received codeword <math>Y^n</math>, i.e., it is a potential erroneous codeword. The bounds in the third property tells us that the "cleaner" a channel is (high <math>I(X;Y)</math>), the less likely we are to make the mistake that <math>\tilde{X}^n</math> was used to transmit <math>Y^n</math>. With a clean channel, essentially we are saying that the transmitted and received codewords are "entangled," and the degree of entanglement gets stronger if we use longer blocklengths.

Latest revision as of 22:27, 27 April 2021

In this article, we discuss an extension of typical sets to jointly distributed random variables.

Definition

Let be an -sequence of pairs drawn from the set . For a fixed , is said to be jointly typical with respect to random variables and if the three conditions below are all satisfied:

From the definition, we see that the notion of joint typicality is stronger than the previous formulation. If we only needed the sequence of pairs for , we could simply invoke the Asymptotic Equipartition Property with the random variable replaced by . The first two properties further constrain the component sequences and beyond the earlier requirement of AEP. Altogether, these three properties guarantee us that sufficiently long sequences behave near their true joint and marginal distributions with very high probability.

Example: binary symmetric channel

For our example, we consider the sequence of input-output pairs of a binary symmetric channel (BSC). A BSC takes in inputs from the input alphabet and produces outputs which also belong to the same alphabet . If we write the joint distribution as , we can generate the pairs by first generating a value using , and then drawing a value using the conditional distribution . For the BSC, if and if , where is called the crossover probability.

In our calculation, the input is assumed to be distributed such that , and the crossover probability is set to . The typicality threshold is set at . From and , we obtain the following joint distribution:

3/10 1/30
1/15 3/5

The table below shows the probability of producing a jointly typical sequence as the blocklength increases.

50 0
100 0.003050
500 0.155243
1000 0.317690
5000 0.703099
10000 0.818271
50000 0.982649

To produce the values in the table, we enumerate the relatively frequencies of pairs that will result in a typical sequence for a fixed . In our example, you can think of this a "profile" consisting of four numbers, , where is the number of pairs in the sequence for . At , there is only one such profile: . Using the multinomial theorem, we can calculate the probability of this profile as

For larger , the number of profiles increases and concentrates the total probability weight near the underlying joint distribution .

Important Properties

Fix can let be the typical set of pairs of -sequences and .

  • The probability of that an -sequence drawn i.i.d using is -typical approaches 1 for sufficiently large n.
  • The number of elements in is at most .
  • If and are independent but with the marginal probabilities and , then for sufficiently large ,

The first two properties are direct extensions of the typical set property for a single random variable . Here, we simply replaced by .

The third property is more meaningful in the context of data transmission and is in fact used in the proof of Shannon's channel coding theorem. We can interpret as the channel output corresponding to the channel input . In most practical channels, and are not independent. Under the lens of data transmission, can be interpreted as a codeword that does not correspond to the received codeword , i.e., it is a potential erroneous codeword. The bounds in the third property tells us that the "cleaner" a channel is (high ), the less likely we are to make the mistake that was used to transmit . With a clean channel, essentially we are saying that the transmitted and received codewords are "entangled," and the degree of entanglement gets stronger if we use longer blocklengths.