Difference between revisions of "161-A1.2"

From Microlab Classes
Jump to navigation Jump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==A unified view of random variables==
 
==A unified view of random variables==
  
In EEE 137, you were first introduced to a `single` random variable <math>X</math> and then later taught to deal with `multiple` random variables. Fortunately, information theory allows us to detach ourselves from the specific values taken by a random variable and simply examine the "amount" of randomness it contains. In this course, we can afford to be indifferent between a random variable that has a 50-50 chance of being a head or a tail (a coin toss), and another random variable that has a 50-50 chance of being sunny or rainy (a weather forecast). As far as information theory is concerned, "heads"/"tails," and "sunny"/"rainy" are just labels.
+
In EEE 137, you were first introduced to a ''single'' random variable <math>X</math> and then later taught to deal with ''multiple'' random variables. Fortunately, information theory allows us to detach ourselves from the specific values taken by a random variable and simply examine the "amount" of randomness it contains. In this course, we can afford to be indifferent between a random variable that has a 50-50 chance of being a head or a tail (a coin toss), and another random variable that has a 50-50 chance of being sunny or rainy (a weather forecast). As far as information theory is concerned, "heads"/"tails," and "sunny"/"rainy" are just labels.
  
Under the mindset that labels do not matter (please do not take this too seriously and carelessly drag it out of current context), the next example aims to illustrate that a pair of random variables can be treated as though it were a single random variable. Consider a discrete random variable <math>X</math> over the alphabet <math>\{1, 2, 3, 4\}</math> and the probability mass function (pmf) <math>\{p(x) = x/10\}</math>. Now, consider a pair of random variables <math>(Y,Z)</math> where Y takes values from <math>\mathcal{Y}=\{-1,1\}</math> and Z takes values from <math>\mathcal{Z}=\{-1,1\}</math>. Suppose that <math>(Y,Z)</math> the following joint pmf:
+
Under the mindset that labels do not matter (please do not take this too seriously and carelessly drag it out of current context), the next example aims to illustrate that a pair of random variables can be treated as if it was a single random variable. Consider a discrete random variable <math>X</math> over the alphabet <math>\{1, 2, 3, 4\}</math> and the probability mass function (pmf) <math>\{p(x) = x/10\}</math>,
* <math>p(-1,-1) = 0.3 </math>
 
* <math>p(-1, 1) = 0.1 </math>
 
* <math>p(1, -1) = 0.4 </math>
 
* <math>p(1,1) = 0.2 </math>
 
  
Our new-found mindset will soon realize that the array above is simply a rearrangement of the probability weights 0.1, 0.2, 0.3 and 0.4. Indeed, we can go so far as to produce the following stem plot.
+
{| class="wikitable"
 +
|-
 +
|
 +
<math>x</math>
 +
|
 +
1
 +
|
 +
2
 +
|
 +
3
 +
|
 +
4
 +
|-
 +
|
 +
<math>p(x)</math>
 +
|
 +
0.1
 +
|
 +
0.2
 +
|
 +
0.3
 +
|
 +
0.4
 +
|-
 +
|}
 +
 
 +
Now, consider a pair of random variables <math>(Y,Z)</math> where Y takes values from <math>\mathcal{Y}=\{-1,1\}</math> and Z takes values from <math>\mathcal{Z}=\{-1,1\}</math>. Suppose that <math>(Y,Z)</math> the following joint pmf:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! scope="col"|
 +
! scope="col"| <math>Z=-1</math>
 +
! scope="col"| <math>Z=1</math>
 +
|-
 +
! scope="row"| <math>Y=-1</math>
 +
| 0.3
 +
| 0.1
 +
|-
 +
! scope="row"| <math>Y=1</math>
 +
| 0.4
 +
| 0.2
 +
|-
 +
|}
 +
 
 +
Our new-found mindset allows us to realize that the array above is simply a rearrangement of the probability weights 0.1, 0.2, 0.3 and 0.4. Indeed, we can go so far as to produce the following stem plot.
  
 
[[File:Stem-plot.svg]]
 
[[File:Stem-plot.svg]]
Line 15: Line 55:
 
==Entropy and conditional entropy==
 
==Entropy and conditional entropy==
  
With a unified view of random variables, we are now equipped to look at the entropy of ''both'' (single) random variables and jointly random variables.  
+
With a unified view of random variables, we are now equipped to look at the entropy of ''both'' (single) random variables and jointly distributed random variables.  
  
 
<math>H(X) = -\sum_{x\in\mathcal{X}} p(x) \log p(x)  </math>
 
<math>H(X) = -\sum_{x\in\mathcal{X}} p(x) \log p(x)  </math>
Line 21: Line 61:
 
<math>H(X,Y) = -\sum_{(x,y)\in\mathcal{X}\times\mathcal{Y}} p(x,y) \log p(x,y)  </math>
 
<math>H(X,Y) = -\sum_{(x,y)\in\mathcal{X}\times\mathcal{Y}} p(x,y) \log p(x,y)  </math>
  
For our discussion in this course, we can just look at the entropy as a function of individual probability masses. The only difference is that for jointly random variables, the alphabet can be large in general due to the number of possible pair between the two alphabets <math>\mathcal{X}</math> and <math>\mathcal{Y}</math>. In essence, working with more random variables simply means that we will have to spend more time calculating it. The beauty of the underlying mathematics here lies in its power to make simple, succinct statements that holds true for many cases ''even without the need to calculate individual terms''.
+
For our discussion in this course, we can just look at the entropy as a function of individual probability masses. The only difference is that for jointly random variables, the alphabet can be large in general due to the number of possible pairs between the two alphabets <math>\mathcal{X}</math> and <math>\mathcal{Y}</math>. In essence, working with more random variables simply means that we will have to spend more time calculating it. The beauty of the underlying mathematics here lies in its power to make simple, succinct statements that holds true for many cases ''even without the need to calculate individual terms''.
  
 
Again, we use our unified view in discussing conditional entropy. From the expression H(X|Y), it should now be straightforward to extend the results to "heavier" expressions like <math>H(X,Y|Z)</math>, <math>H(X|Y,Z)</math>, or <math>H(W,X|Y,Z)</math>.
 
Again, we use our unified view in discussing conditional entropy. From the expression H(X|Y), it should now be straightforward to extend the results to "heavier" expressions like <math>H(X,Y|Z)</math>, <math>H(X|Y,Z)</math>, or <math>H(W,X|Y,Z)</math>.
Line 49: Line 89:
 
<syntaxhighlight lang="python3">
 
<syntaxhighlight lang="python3">
 
import random
 
import random
samples = random.choices([(0,0), (0,1), (1,0), (1,1)], weights=[0.1, 0.2, 0.3, 0.4], k=10)
+
samples = random.choices([(0,0), (0,1), (1,0), (1,1)], weights=[0.1, 0.2, 0.3, 0.4], k=20)
 
</syntaxhighlight>
 
</syntaxhighlight>
 
which might generate the following list
 
which might generate the following list
Line 66: Line 106:
 
From the observed samples alone, we produce the following estimates:
 
From the observed samples alone, we produce the following estimates:
 
* For <math>Y = 0</math>, we estimate the conditional entropy <math>H(X|Y=0)</math> as <math>0</math>.
 
* For <math>Y = 0</math>, we estimate the conditional entropy <math>H(X|Y=0)</math> as <math>0</math>.
* For <math>Y = 1</math>, we estimate the conditional entropy <math>H(Y|Y=1)</math> as <math>0.9403</math>.
+
* For <math>Y = 1</math>, we estimate the conditional entropy <math>H(X|Y=1)</math> as <math>0.9403</math>.
  
Note that <math>H(X|Y=0)</math> is estimated as zero because when we only look at the samples where <math>Y=0</math>, <math>X</math> does not exhibit any variation at all!
+
Note that <math>H(X|Y=0)</math> is estimated as zero because when we only look at the samples where <math>Y=0</math>, <math>X</math> does not exhibit any variation at all! In <code>y1_samples</code>, there are five pairs where <math>X=0</math> and nine pairs where <math>X=1</math>, from which we estimate conditional distribution of <math>X</math> given <math>Y=1</math> to be <math>[5/14, 9/14]</math>, which has an entropy of around 0.9403 (using base-2 logarithms).
  
 
The expression <math>H(X|Y)</math> uses the probability distribution of Y to add greater weight to more probable values. In our example, since there are six pairs where <math>Y=0</math> and 14 pairs where <math>Y=1</math>. Based on the observed samples, we might estimate that <math>p(Y=0) = 0.3</math> and <math>p(Y=1) = 0.7</math>. At this point, conditional entropy should be now intuitive in the following sense: if we fix <math>Y</math> to some value, we ''induce'' a (conditional) probability distribution over <math>X</math> which corresponds to some entropy <math>H(X|Y=y)</math>. To get the ''overall'' conditional entropy, we take the weighted average of <math>H(X|Y=y)</math> for different values of <math>y</math>:
 
The expression <math>H(X|Y)</math> uses the probability distribution of Y to add greater weight to more probable values. In our example, since there are six pairs where <math>Y=0</math> and 14 pairs where <math>Y=1</math>. Based on the observed samples, we might estimate that <math>p(Y=0) = 0.3</math> and <math>p(Y=1) = 0.7</math>. At this point, conditional entropy should be now intuitive in the following sense: if we fix <math>Y</math> to some value, we ''induce'' a (conditional) probability distribution over <math>X</math> which corresponds to some entropy <math>H(X|Y=y)</math>. To get the ''overall'' conditional entropy, we take the weighted average of <math>H(X|Y=y)</math> for different values of <math>y</math>:
Line 80: Line 120:
 
From this specific example, the following general expression should now make sense:
 
From this specific example, the following general expression should now make sense:
 
<math>H(X|Y) = \sum_{y\in \mathcal{Y}} p(y)H(X|Y=y)</math>
 
<math>H(X|Y) = \sum_{y\in \mathcal{Y}} p(y)H(X|Y=y)</math>
 +
 +
{{Note|Checkpoint: Compare the above expression with other expressions in our previous discussion of [[Conditional Entropy and Mutual Information|conditional entropy]].|reminder}}
  
 
==Simulating a random source==
 
==Simulating a random source==
Line 89: Line 131:
 
Nowadays, producing high-quality sources of randomness is made easy since they often come shipped with many of your favorite applications, from numerical software to spreadsheets or even photo-editing software. The Mersenne Twister, which forms the backbone of many of these sources, generate close to uniformly random bitstreams that can be post-processed and be "shaped" to a wide family of probability distributions.
 
Nowadays, producing high-quality sources of randomness is made easy since they often come shipped with many of your favorite applications, from numerical software to spreadsheets or even photo-editing software. The Mersenne Twister, which forms the backbone of many of these sources, generate close to uniformly random bitstreams that can be post-processed and be "shaped" to a wide family of probability distributions.
  
== Exercise 1: Estimating mutual information ==
+
== Exercise 1.1: Estimating mutual information (9 points) ==
  
 
With a fairly reliable source of randomness, it is now up to us to estimate information measures from the samples. This is useful when we want to reason about the amount of randomness in certain variables. Mutual information, in particular, finds it use in reasoning about how much information we can obtain about a given random variable <math>X</math> by observing ''another'' random variable <math>Y</math>.
 
With a fairly reliable source of randomness, it is now up to us to estimate information measures from the samples. This is useful when we want to reason about the amount of randomness in certain variables. Mutual information, in particular, finds it use in reasoning about how much information we can obtain about a given random variable <math>X</math> by observing ''another'' random variable <math>Y</math>.
  
For your first graded programming exercise, you are to write a function <code>estimate_mutual_information</code> which will take one argument, <code>xy_samples</code>, which is a list of <code>tuple</code> objects corresponding to pairs of samples <math>(X,Y)</math>. From the list of tuples, the function should return a <code>float</code> corresponding to the estimated mutual information <math>I(X;Y)</math>.
+
===Task description===
 
+
For your first graded programming exercise, you are to write a function <code>estimated_mutual_information</code> which will take one argument, <code>xy_samples</code>, which is a list of <code>tuple</code> objects corresponding to pairs of samples <math>(X,Y)</math>. From the list of tuples, the function should return a <code>float</code> corresponding to the estimated mutual information <math>I(X;Y)</math>. To make things simple, you may assume <math>X</math> and <math>Y</math> are integers.
The header script for Exercise 1 will simply import the <code>math</code> module. The execution script will then be as follows:
 
 
 
<syntaxhighlight lang="python3">
 
import random
 
 
 
test_global_var = 12345
 
  
def extra_function(args):
+
The header script for Exercise 1 will simply import the <code>math</code> module. During testing, the following Python script will run:
    # some code
 
    # ....
 
 
 
def estimate_mutual_information(xy_samples):
 
    # your code..
 
    # more code..
 
 
 
    return ans
 
}
 
 
 
def another_function(args):
 
    # some code
 
    # ....
 
 
 
</syntaxhighlight>
 
 
 
During testing, the following Python script will run:
 
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 133: Line 152:
 
|
 
|
 
<syntaxhighlight lang="python3">
 
<syntaxhighlight lang="python3">
def estimate_mutual_information(xy_samples):
+
def estimated_mutual_information(xy_samples):
 
     # your code..
 
     # your code..
 
     # more code..
 
     # more code..
Line 150: Line 169:
 
|}
 
|}
  
Please be careful not to call functions imported from libraries not specified in the header script. Doing that will cause the program to throw a <code>NameError</code>. Accessing global variables such as <code>test_global_var</code> in the example will similarly cause an error.
+
Please be careful not to call functions imported from libraries not specified in the header script. Doing that will cause the program to throw a <code>NameError</code>. Accessing global variables will similarly cause an error.
  
 
Do not include I/O commands such as <code>input</code>, <code>print</code>, or <code>open</code>. The validator script will return an error upon detecting usage of such functions.
 
Do not include I/O commands such as <code>input</code>, <code>print</code>, or <code>open</code>. The validator script will return an error upon detecting usage of such functions.
 +
 +
===Sample input===
 +
Here are some small test cases that you may try:
 +
{| class="wikitable"
 +
|-
 +
| Input list
 +
| Return value
 +
|-
 +
| <code>[(0,0),(0,1),(1,0),(1,1)]</code>
 +
| <code>0.0</code>
 +
|-
 +
| <code>[(0,0),(0,1),(1,0)]</code>
 +
| <code>0.25162</code>
 +
|-
 +
| <code>[(0,0),(1,1)]</code>
 +
| <code>1.0</code>
 +
|-
 +
|}
 +
In your code, make sure that you use base-2 logarithm in your calculations.
 +
 +
== Exercise 1.2: Conditional entropy (1 point) ==
 +
 +
Let <math>X</math> and <math>Y</math> be jointly distributed with the following distribution:
 +
 +
{| class="wikitable"
 +
|-
 +
! scope="col"|
 +
! scope="col"| <math>Y=0</math>
 +
! scope="col"| <math>Y=1</math>
 +
|-
 +
! scope="row"| <math>X=0</math>
 +
| 0
 +
| 1/3
 +
|-
 +
! scope="row"| <math>X=1</math>
 +
| 1/6
 +
| 1/2
 +
|-
 +
|}
 +
 +
Now, consider a new random variable <math>Z</math> which is defined as the XOR of <math>X</math> and <math>Y</math>. Calculate the conditional entropy <math>H(X,Y|Z)</math>.
 +
 +
Submit your PDF solution (at most one page) [https://docs.google.com/forms/d/e/1FAIpQLSdE5LPd8nximVqi13JXg-rjWuSgiubKIdmHkb-YkvZj2B2OBw/viewform?usp=sf_link here].

Latest revision as of 18:52, 10 March 2021

A unified view of random variables

In EEE 137, you were first introduced to a single random variable and then later taught to deal with multiple random variables. Fortunately, information theory allows us to detach ourselves from the specific values taken by a random variable and simply examine the "amount" of randomness it contains. In this course, we can afford to be indifferent between a random variable that has a 50-50 chance of being a head or a tail (a coin toss), and another random variable that has a 50-50 chance of being sunny or rainy (a weather forecast). As far as information theory is concerned, "heads"/"tails," and "sunny"/"rainy" are just labels.

Under the mindset that labels do not matter (please do not take this too seriously and carelessly drag it out of current context), the next example aims to illustrate that a pair of random variables can be treated as if it was a single random variable. Consider a discrete random variable over the alphabet and the probability mass function (pmf) ,

1

2

3

4

0.1

0.2

0.3

0.4

Now, consider a pair of random variables where Y takes values from and Z takes values from . Suppose that the following joint pmf:

0.3 0.1
0.4 0.2

Our new-found mindset allows us to realize that the array above is simply a rearrangement of the probability weights 0.1, 0.2, 0.3 and 0.4. Indeed, we can go so far as to produce the following stem plot.

Stem-plot.svg

Entropy and conditional entropy

With a unified view of random variables, we are now equipped to look at the entropy of both (single) random variables and jointly distributed random variables.

For our discussion in this course, we can just look at the entropy as a function of individual probability masses. The only difference is that for jointly random variables, the alphabet can be large in general due to the number of possible pairs between the two alphabets and . In essence, working with more random variables simply means that we will have to spend more time calculating it. The beauty of the underlying mathematics here lies in its power to make simple, succinct statements that holds true for many cases even without the need to calculate individual terms.

Again, we use our unified view in discussing conditional entropy. From the expression H(X|Y), it should now be straightforward to extend the results to "heavier" expressions like , , or .

Consider fixing Y=y. The marginal probabilities produce a valid probability mass function in its own right, and hence we can calculate the (regular) entropy . This tell us that by covering all possible values , we will obtain values.

For example,

0.1 0.2
0.3 0.4

We can simulate sampling, say 20 pairs, from this joint distribution by using the following code:

import random
samples = random.choices([(0,0), (0,1), (1,0), (1,1)], weights=[0.1, 0.2, 0.3, 0.4], k=20)

which might generate the following list

[(0, 1), (0, 1), (1, 1), (0, 1), (1, 1), (1, 1), (1, 0), (1, 0), (1, 1), (1, 1), (0, 1), (1, 0), (1, 0), (1, 1), (1, 1), (1, 1), (1, 0), (1, 0), (1, 1), (0, 1)]

If we want to inspect , we need to "categorize" the samples into those that have Y=0,

y0_samples = [(1, 0), (1, 0), (1, 0), (1, 0), (1, 0), (1, 0)]

and those that have Y=1,

y1_samples = [(0, 1), (0, 1), (1, 1), (0, 1), (1, 1), (1, 1), (1, 1), (1, 1), (0, 1), (1, 1), (1, 1), (1, 1), (1, 1), (0, 1)]

From the observed samples alone, we produce the following estimates:

  • For , we estimate the conditional entropy as .
  • For , we estimate the conditional entropy as .

Note that is estimated as zero because when we only look at the samples where , does not exhibit any variation at all! In y1_samples, there are five pairs where and nine pairs where , from which we estimate conditional distribution of given to be , which has an entropy of around 0.9403 (using base-2 logarithms).

The expression uses the probability distribution of Y to add greater weight to more probable values. In our example, since there are six pairs where and 14 pairs where . Based on the observed samples, we might estimate that and . At this point, conditional entropy should be now intuitive in the following sense: if we fix to some value, we induce a (conditional) probability distribution over which corresponds to some entropy . To get the overall conditional entropy, we take the weighted average of for different values of :

At this point, it is helpful to outline the process we just did:

  • For each possible value , we calculate
  • Calculate as the weighted average of the values using the probability as weights.

From this specific example, the following general expression should now make sense:

Checkpoint: Compare the above expression with other expressions in our previous discussion of conditional entropy.

Simulating a random source

As shown in the previous section, we can use Python to create such a source of randomness. From our 20 sampled pairs, we can estimate , but from the joint distribution, we can calculate the exact value , rounded to three decimal places. This means our estimate is off by around 25% ! Fortunately, it is possible to use many samples to reach the true value as close as possible. That such precision is possible is a consequence of the law of large numbers.

The law of large numbers is an optimistic statement that promises us that we can estimate the probability of an event that can be very close to the true value with very small variance. A close inspection of the definition of information measures reveals that they all boil down to obtaining the necessary probabilities. Hence, if we have an accurate way of estimating the required probabilities, we can also estimate the information measures quite accurately. If we want to reap the guarantee granted by the law of large numbers, we must have a way of generating independent and identically distributed samples.

Nowadays, producing high-quality sources of randomness is made easy since they often come shipped with many of your favorite applications, from numerical software to spreadsheets or even photo-editing software. The Mersenne Twister, which forms the backbone of many of these sources, generate close to uniformly random bitstreams that can be post-processed and be "shaped" to a wide family of probability distributions.

Exercise 1.1: Estimating mutual information (9 points)

With a fairly reliable source of randomness, it is now up to us to estimate information measures from the samples. This is useful when we want to reason about the amount of randomness in certain variables. Mutual information, in particular, finds it use in reasoning about how much information we can obtain about a given random variable by observing another random variable .

Task description

For your first graded programming exercise, you are to write a function estimated_mutual_information which will take one argument, xy_samples, which is a list of tuple objects corresponding to pairs of samples . From the list of tuples, the function should return a float corresponding to the estimated mutual information . To make things simple, you may assume and are integers.

The header script for Exercise 1 will simply import the math module. During testing, the following Python script will run:

Python test script for execution
Header
import math
Student submission (function only)
def estimated_mutual_information(xy_samples):
    # your code..
    # more code..

    return ans
}
Main test script
# Main test script goes here, and will not be visible to students.
# The script will test if the submitted code does the prescribed functionality.
# For successfully validated scripts, test results will be sent via email.

Please be careful not to call functions imported from libraries not specified in the header script. Doing that will cause the program to throw a NameError. Accessing global variables will similarly cause an error.

Do not include I/O commands such as input, print, or open. The validator script will return an error upon detecting usage of such functions.

Sample input

Here are some small test cases that you may try:

Input list Return value
[(0,0),(0,1),(1,0),(1,1)] 0.0
[(0,0),(0,1),(1,0)] 0.25162
[(0,0),(1,1)] 1.0

In your code, make sure that you use base-2 logarithm in your calculations.

Exercise 1.2: Conditional entropy (1 point)

Let and be jointly distributed with the following distribution:

0 1/3
1/6 1/2

Now, consider a new random variable which is defined as the XOR of and . Calculate the conditional entropy .

Submit your PDF solution (at most one page) here.