Difference between revisions of "2S2122 Activity 2.2"

Latest revision as of 18:39, 24 February 2022

Instructions

There are two parts to this exercise.

1. Read through the discussion about information and language.

2. After understanding the discussion, we'll replicate it so that you can get hands on experience on the analysis.

Submission guidelines:

For every programming exercise, you are to submit your .ipynb file into your respective submission bin.
To download your .ipynb file, first in your Google Colab page go to File > Download > Download .ipynb file.
Don't forget to rename your .ipynb file with the following format "class_lastname_firstname_studentnumber.ipynb".

Information and Language

Preliminary Concepts

One interesting application of appreciating information theory is to determine the average information per letter in a language. In this programming exercise, we will explore and try to understand what entropy means for the English language. Before we begin let's set the context of our problem first.

We'll be characterizing the information content of the English language based only on the frequency of letters.
We'll expand this problem to a block of letters but get the relative entropy for a single letter only.
We'll exclude the conditional probabilities for this exercise because that will be more challenging to build.

Let's first begin by defining the word N-block as the number of letters combined to represent a single outcome in a random variable $X$ . Consider the sentence:

"The quick brown fox jumps over the lazy dog"

When we say we want to get the 1-block, then $X$ contains:

$\{t,h,e,q,u,i,c,k,b,r,o,w,n,f,x,j,m,p,s,v,l,a,z,y,d,g,{\textrm {space}}\}$

Of course, the elements can be re-arranged in order:

$\{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,{\textrm {space}}\}$

The set also contains the space. For now, let's consider spaces as an important component in our analysis. The sentence above has all the letters in the alphabet. Remember, a random variable $X$ would also have a probability distribution associated with it. We can determine the probability based on the frequency of letters for the entire set. For example:

${\textrm {space}}$ appears $8$ times
$o$ appears $4$ times
$e$ appears $3$ times
And so on...

The total number of occurrences of all outcomes is $43$ . The probability distribution would be:

$P(a)={\frac {1}{43}}=0.023$
$P(b)={\frac {1}{43}}=0.023$
$P(c)={\frac {1}{43}}=0.023$
$P(d)={\frac {1}{43}}=0.023$
$P(e)={\frac {3}{43}}=0.070$
$P(f)={\frac {1}{43}}=0.023$
$P(g)={\frac {1}{43}}=0.023$
$P(h)={\frac {2}{43}}=0.047$
$P(i)={\frac {1}{43}}=0.023$
$P(j)={\frac {1}{43}}=0.023$
$P(k)={\frac {1}{43}}=0.023$
$P(l)={\frac {1}{43}}=0.023$
$P(m)={\frac {1}{43}}=0.023$
$P(n)={\frac {1}{43}}=0.023$
$P(o)={\frac {4}{43}}=0.093$
$P(p)={\frac {1}{43}}=0.023$
$P(q)={\frac {1}{43}}=0.023$
$P(r)={\frac {2}{43}}=0.047$
$P(s)={\frac {1}{43}}=0.023$
$P(t)={\frac {2}{43}}=0.047$
$P(u)={\frac {2}{43}}=0.047$
$P(v)={\frac {1}{43}}=0.023$
$P(w)={\frac {1}{43}}=0.023$
$P(x)={\frac {1}{43}}=0.023$
$P(y)={\frac {1}{43}}=0.023$
$P(z)={\frac {1}{43}}=0.023$
$P({\textrm {space}})={\frac {8}{43}}=0.186$

We expanded the distribution for clarity and guidance. We can calculate the entropy for a 1-block using:

$H(X)=-\sum _{i=x}^{n}P(x_{i})\log _{2}(P(x_{i}))$

Calculating the entropy for our simple sentence example yields $H(X)\approx 4.39$ bits. We'll define a new metric called the relative N-block entropy as the entropy of an N-block relative to a single letter. Mathematically this is:

$G_{N}(X)=-{\frac {1}{N}}\sum _{i=x}^{n}P(x_{i})\log _{2}(P(x_{i}))={\frac {1}{N}}H(X)$

We will see why we're interested in this later on. For a 1-block, $H(X)=G_{1}(X)=4.39$ bits. Now, suppose we are interested in 2-block form. That means we need to cut the entire sentence in two consecutive blocks. For example, in 2-block form $X$ would now have outcomes as:

${\begin{aligned}\{&(t,h),(e,{\textrm {space}}),(q,u),(i,c),(k,{\textrm {space}}),(b,r),(o,w),(n,{\textrm {space}}),\\&(f,o),(x,{\textrm {space}}),(j,u),(m,p),(s,{\textrm {space}}),(o,v),(e,r),({\textrm {space}},t),(h,e),({\textrm {space}},l),(a,z),(y,{\textrm {space}}),(d,o)\}\end{aligned}}$

It's simply taking 2-blocks of letters and grouping them into one outcome. Generating the list above should be clear. Here are a few things to consider:

The order matters. Though none of it appeared in the above example, the outcome $(a,b)$ is distinct or unique to $(b,a)$ . Both outcomes are NOT the same so we don't add up the occurrences for these events.
The last letter $g$ was omitted from our list because there is no follow up character. For now, let's treat it this way for convenience.
Interestingly, the outcomes from the list are equiprobable. Therefore it is easy to show that the entropy is $H(X)=\log _{2}(21)\approx 4.40$ bits.
The relative N-block entropy is $G_{2}(X)={\frac {1}{2}}H(X)\approx 2.20$ bits. Take note that this relative entropy is the entropy of the 2-block relative to a single letter only.

The division ${\frac {1}{N}}$ is meant to average the entropy of an N-block to a single letter. We are interested in this metric because we want to compare our results in an apples-to-apples way. Let's try one last example for a 3-block set. $X$ would contain:

${\begin{aligned}\{&(t,h,e),({\textrm {space}},q,u),(i,c,k),({\textrm {space}},b,r),(o,w,n),({\textrm {space}},f,o),(x,{\textrm {space}},j),\\&(u,m,p),(s,{\textrm {space}},o),(v,e,r),({\textrm {space}},t,h),(e,{\textrm {space}},l),(a,z,y),({\textrm {space}},d,o)\}\end{aligned}}$

For the 3-block list we have:

Combinations of three letters.
The last letter $g$ was omitted again due to lack of follow up letters.
All elements are equiprobable again. This leads to $H(X)=\log _{2}(14)\approx 3.81$ bits.
The relative 3-block entropy is $G_{3}(X)={\frac {1}{3}}H(X)\approx 1.27$ bits for 1 letter.

In summary:

N-block pertains to the number of letters we are grouping together. Each N-block produces a list of combinations.
We defined $G_{N}(X)={\frac {1}{N}}H(X)$ as the relative N-block entropy to a single letter only.

Interpretation of N-block Information

Let's review some of the results we obtained earlier:

For 1-block we have $G_{1}(X)=4.39$ bits for 1 letter.
For 2-block we have $G_{2}(X)=2.20$ bits for 1 letter.
For 3-block we have $G_{3}(X)=1.27$ bits for 1 letter.

Again, we say for 1 letter only because we averaged the entropy by dividing it by the block size. As the block size increases, the entropy for a single letter decreases. This trend provides an interesting insight on how much each "letter" provides information. For the 1-block case, the relative entropy is high. This makes sense if we put ourselves in a scenario where we obtain a single letter for every 10 seconds only. For example:

t - 10 seconds - h - 10 seconds - e - 10 seconds - space 10 seconds

If we look at the letters one by one, we cannot decipher the message immediately. There is high uncertainty if we look at things at a per letter basis only. Now, suppose we look at things using 2-block size. Then we would obtain something like:

th - 10 seconds - e space - 10 seconds - qu - 10 seconds - ic 10 seconds

At some point, we are starting to recognize that some combinations like th might lead to a word that we know. We have less uncertainty about a letter! You might ask, "We computed the relative entropy for a single letter but you're showing blocks of two letters?". Yes! Which is why we need to look at a per letter basis. If you think about it, when we increase the block size of letters, the letter t now looks less intimidating. Since we know that t is followed by an h our mind starts to think about possible follow up combinations. The same goes for h. Since t came before h, h now becomes less intimidating because our mind starts to connect the dots. Let's look at 3-block example:

the - 10 seconds - space qu - 10 seconds - ick - 10 seconds - space br 10 seconds

Now, the letter t at the beginning "feels" like it is completely certain of its information. For us readers, we know it is a part of the word the and we don't worry about it. Hence, the relative entropy of a single letter in a 3-block is less than a 2-block or 1-block. Technically, the entropies we measured are all joint entropies:

$H(X_{1},X_{2},...,X_{N})=-\sum _{i=1}^{n}\sum _{j=1}^{m}...\sum _{k=1}^{o}P(x_{i},x_{j},...,x_{k})\log _{2}(P(x_{i},x_{j},...,x_{k}))$

Each joint entropy may contain conditional entropies as well (e.g., $P(x_{i},x_{j})=P(x_{i}|x_{j})P(x_{j})$ ). When we measured our information content, we took into account the conditional entropies in our example. Therefore, our observation of decreasing relative entropy due to increasing block size makes sense. Our interpretation is that a letter's uncertainty decreases due to the adjacent letters adding information to that specific letter. This is also consistent with how our mind "sort of" understands information in language. It's interesting to observe that we did not care about the meaning of the sentence. We were interested only in the frequency of outcomes.

Programming Exercise

In our programming exercise, we will analyze four different languages. We have English, French, German, and Tagalog. We will study how the information differs for each language and how it behaves as we increase the N-block size. Go to your UVLE page and make a copy of the Google Colab prepared for you. The exercise is well guided and you only need to work on the get_block_entropy(data,N) function. The details are already in the notebook.

There are 5 steps to this exercise:

Step 1 - Importing useful tools
Step 2 - Making the block-entropy function (5 pts.)
Step 3a - Sanity checking (1 pts.)
Step 3b - Fill up the table (1 pts.)
Step 4 - Plotting stage
Step 5 - Results and discussion (3 pts.)

Some notes:

No need to make recursions.
The get_block_entropy(data,N) can be implemented in less than 20 lines. Although it's not required that you need to keep it in less than 20 lines.
Please work on this INDIVIDUALLY. Any sign of cheating will be reprimanded. You wouldn't want to get to this part 😡.
Please appreciate the exercise and have fun! 😍

Grading Rubrics

The function needs to work. We will check it carefully.
The scores provided in the steps are as is.

@@ Line 1: / Line 1: @@
 == Instructions ==
-There are two parts to this exercise. First, read through the discussion about information and language. Second, after understanding the discussion, we'll replicate it so that you can get hands on experience on the analysis.
+There are two parts to this exercise.
+. Read through the discussion about information and language.
+. After understanding the discussion, we'll replicate it so that you can get hands on experience on the analysis.
+'''Submission guidelines:'''
+* For every programming exercise, you are to submit your .ipynb file into your respective submission bin.
+* To download your .ipynb file, first in your Google Colab page go to '''File > Download > Download .ipynb file'''.
+* Don't forget to rename your .ipynb file with the following format '''"class_lastname_firstname_studentnumber.ipynb"'''.
 == Information and Language ==
+=== Preliminary Concepts ===
+One interesting application of appreciating information theory is to determine the average information per letter in a language. In this programming exercise, we will explore and try to understand what entropy means for the English language. Before we begin let's set the context of our problem first.
+* We'll be characterizing the information content of the English language based only on the frequency of letters.
+* We'll expand this problem to a block of letters but get the relative entropy for a single letter only.
+* We'll exclude the conditional probabilities for this exercise because that will be more challenging to build.
+Let's first begin by defining the word '''N-block''' as the number of letters combined to represent a single outcome in a random variable <math> X </math>. Consider the sentence:
+'''"The quick brown fox jumps over the lazy dog"'''
+When we say we want to get the '''1-block''', then <math> X </math> contains:
+<math>\{t,h,e,q,u,i,c,k,b,r,o,w,n,f,x,j,m,p,s,v,l,a,z,y,d,g,\textrm{space}\} </math>
+Of course, the elements can be re-arranged in order:
+<math> \{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,\textrm{space}\} </math>
+The set also contains the space. For now, let's consider spaces as an important component in our analysis. The sentence above has all the letters in the alphabet. Remember, a random variable <math> X </math> would also have a probability distribution associated with it. We can determine the probability based on the frequency of letters for the entire set. For example:
+* <math> \textrm{space} </math> appears <math> 8 </math> times
+* <math> o </math> appears <math> 4 </math> times
+* <math> e </math> appears <math> 3 </math> times
+* And so on...
+The total number of occurrences of all outcomes is <math> 43 </math>. The probability distribution would be:
+* <math> P(a) = \frac{1}{43} = 0.023 </math>
+* <math> P(b) = \frac{1}{43} = 0.023 </math>
+* <math> P(c) = \frac{1}{43} = 0.023 </math>
+* <math> P(d) = \frac{1}{43} = 0.023 </math>
+* <math> P(e) = \frac{3}{43} = 0.070 </math>
+* <math> P(f) = \frac{1}{43} = 0.023 </math>
+* <math> P(g) = \frac{1}{43} = 0.023 </math>
+* <math> P(h) = \frac{2}{43} = 0.047 </math>
+* <math> P(i) = \frac{1}{43} = 0.023 </math>
+* <math> P(j) = \frac{1}{43} = 0.023 </math>
+* <math> P(k) = \frac{1}{43} = 0.023 </math>
+* <math> P(l) = \frac{1}{43} = 0.023 </math>
+* <math> P(m) = \frac{1}{43} = 0.023 </math>
+* <math> P(n) = \frac{1}{43} = 0.023 </math>
+* <math> P(o) = \frac{4}{43} = 0.093 </math>
+* <math> P(p) = \frac{1}{43} = 0.023 </math>
+* <math> P(q) = \frac{1}{43} = 0.023 </math>
+* <math> P(r) = \frac{2}{43} = 0.047 </math>
+* <math> P(s) = \frac{1}{43} = 0.023 </math>
+* <math> P(t) = \frac{2}{43} = 0.047 </math>
+* <math> P(u) = \frac{2}{43} = 0.047 </math>
+* <math> P(v) = \frac{1}{43} = 0.023 </math>
+* <math> P(w) = \frac{1}{43} = 0.023 </math>
+* <math> P(x) = \frac{1}{43} = 0.023 </math>
+* <math> P(y) = \frac{1}{43} = 0.023 </math>
+* <math> P(z) = \frac{1}{43} = 0.023 </math>
+* <math> P(\textrm{space}) = \frac{8}{43} = 0.186 </math>
+We expanded the distribution for clarity and guidance. We can calculate the entropy for a 1-block using:
+<math> H(X) = -\sum_{i=x}^n P(x_i) \log_2 (P(x_i)) </math>
+Calculating the entropy for our simple sentence example yields <math> H(X) \approx 4.39 </math> bits. We'll define a new metric called the '''relative N-block entropy''' as the entropy of an N-block relative to a single letter. Mathematically this is:
+<math> G_N(X) = -\frac{1}{N}\sum_{i=x}^n P(x_i) \log_2 (P(x_i)) = \frac{1}{N} H(X) </math>
+We will see why we're interested in this later on. For a '''1-block''', <math> H(X) = G_1(X) = 4.39 </math> bits. Now, suppose we are interested in '''2-block''' form. That means we need to cut the entire sentence in two consecutive blocks. For example, in 2-block form <math> X </math> would now have outcomes as:
+<math>
+\begin{align}
+\{&(t,h),(e,\textrm{space}),(q,u),(i,c),(k,\textrm{space}), (b,r), (o,w), (n,\textrm{space}), \\
+  &(f,o), (x,\textrm{space}), (j,u), (m,p), (s,\textrm{space}), (o,v), (e,r), (\textrm{space},t), (h,e), (\textrm{space},l), (a,z),(y,\textrm{space}),(d,o)\}
+\end{align}
+</math>
+It's simply taking 2-blocks of letters and grouping them into one outcome. Generating the list above should be clear. Here are a few things to consider:
+* ''The order matters''. Though none of it appeared in the above example, the outcome <math> (a,b) </math> is distinct or unique to <math> (b,a) </math>. Both outcomes are NOT the same so we don't add up the occurrences for these events.
+* The last letter <math> g </math> was omitted from our list because there is no follow up character. For now, let's treat it this way for convenience.
+* Interestingly, the outcomes from the list are equiprobable. Therefore it is easy to show that the entropy is <math> H(X) = \log_2(21) \approx 4.40 </math> bits.
+* The '''relative N-block entropy''' is <math> G_2(X) = \frac{1}{2} H(X) \approx 2.20 </math> bits. Take note that this relative entropy is the entropy of the 2-block relative to a single letter only.
+The division <math> \frac{1}{N} </math> is meant to average the entropy of an N-block to a single letter. We are interested in this metric because we want to compare our results in an ''apples-to-apples'' way. Let's try one last example for a '''3-block''' set. <math> X </math> would contain:
+<math>
+\begin{align}
+\{&(t,h,e),(\textrm{space},q,u),(i,c,k),(\textrm{space},b,r), (o,w,n), (\textrm{space},f,o), (x,\textrm{space},j), \\
+  &(u,m,p), (s,\textrm{space},o), (v,e,r), (\textrm{space},t,h), (e,\textrm{space},l), (a,z,y),(\textrm{space},d,o) \}
+\end{align}
+</math>
+For the 3-block list we have:
+* Combinations of three letters.
+* The last letter <math>g</math> was omitted again due to lack of follow up letters.
+* All elements are equiprobable again. This leads to <math> H(X) = \log_2(14) \approx 3.81 </math> bits.
+* The '''relative 3-block entropy''' is <math> G_3(X) = \frac{1}{3} H(X) \approx 1.27 </math> bits for 1 letter.
+In summary:
+* '''N-block''' pertains to the number of letters we are grouping together. Each N-block produces a list of combinations.
+* We defined <math>G_N(X) = \frac{1}{N} H(X)</math> as the '''relative N-block entropy''' to a single letter only.
+=== Interpretation of N-block Information ===
+Let's review some of the results we obtained earlier:
+* For 1-block we have <math>G_1(X) = 4.39 </math> bits for 1 letter.
+* For 2-block we have <math>G_2(X) = 2.20 </math> bits for 1 letter.
+* For 3-block we have <math>G_3(X) = 1.27 </math> bits for 1 letter.
+Again, we say for 1 letter only because we averaged the entropy by dividing it by the block size. As the block size increases, the entropy for a single letter decreases. This trend provides an interesting insight on how much each "letter" provides information. For the 1-block case, the relative entropy is high. This makes sense if we put ourselves in a scenario where we obtain a single letter for every 10 seconds only. For example:
+'''t''' - 10 seconds - '''h''' - 10 seconds - '''e''' - 10 seconds - '''space''' 10 seconds
+If we look at the letters one by one, we cannot decipher the message immediately. There is high uncertainty if we look at things at a per letter basis only. Now, suppose we look at things using 2-block size. Then we would obtain something like:
+'''th''' - 10 seconds - '''e space''' - 10 seconds - '''qu ''' - 10 seconds - '''ic''' 10 seconds
+At some point, we are starting to recognize that some combinations like '''th''' might lead to a word that we know. We have less uncertainty about a letter! You might ask, "We computed the relative entropy for a single letter but you're showing blocks of two letters?". Yes! Which is why we need to look at a per letter basis. If you think about it, when we increase the block size of letters, the letter '''t''' now looks less intimidating. Since we know that '''t''' is followed by an '''h''' our mind starts to think about possible follow up combinations. The same goes for '''h'''. Since '''t''' came before '''h''', '''h''' now becomes less intimidating because our mind starts to connect the dots. Let's look at 3-block example:
+'''the''' - 10 seconds - '''space qu''' - 10 seconds - '''ick''' - 10 seconds - '''space br''' 10 seconds
+Now, the letter '''t''' at the beginning "feels" like it is completely certain of its information. For us readers, we know it is a part of the word '''the''' and we don't worry about it. Hence, the relative entropy of a single letter in a 3-block is less than a 2-block or 1-block. Technically, the entropies we measured are all joint entropies:
+<math> H(X_1,X_2,...,X_N) = -\sum_{i=1}^n \sum_{j=1}^m ... \sum_{k=1}^o P(x_i,x_j,...,x_k) \log_2 (P(x_i,x_j,...,x_k)) </math>
+Each joint entropy may contain conditional entropies as well (e.g., <math> P(x_i,x_j) = P(x_i|x_j)P(x_j) </math>). When we measured our information content, we took into account the conditional entropies in our example. Therefore, our observation of decreasing relative entropy due to increasing block size makes sense. Our interpretation is that a letter's uncertainty decreases due to the adjacent letters adding information to that specific letter. This is also consistent with how our mind "sort of" understands information in language. It's interesting to observe that we did not care about the meaning of the sentence. We were interested only in the frequency of outcomes.
 == Programming Exercise ==
+In our programming exercise, we will analyze four different languages. We have English, French, German, and Tagalog. We will study how the information differs for each language and how it behaves as we increase the N-block size. Go to your UVLE page and make a copy of the Google Colab prepared for you. The exercise is well guided and you only need to work on the <code> get_block_entropy(data,N) </code> function. The details are already in the notebook.
+There are 5 steps to this exercise:
+* Step 1 - Importing useful tools
+* Step 2 - Making the block-entropy function (5 pts.)
+* Step 3a - Sanity checking (1 pts.)
+* Step 3b - Fill up the table (1 pts.)
+* Step 4 - Plotting stage
+* Step 5 - Results and discussion (3 pts.)
+Some notes:
+* No need to make recursions.
+* The <code> get_block_entropy(data,N) </code> can be implemented in less than 20 lines. Although it's not required that you need to keep it in less than 20 lines.
+* Please work on this INDIVIDUALLY. Any sign of cheating will be reprimanded. You wouldn't want to get to this part 😡.
+* Please appreciate the exercise and have fun! 😍
+== Grading Rubrics ==
+* The function needs to work. We will check it carefully.
+* The scores provided in the steps are as is.

Difference between revisions of "2S2122 Activity 2.2"

Latest revision as of 18:39, 24 February 2022

Contents

Instructions

Information and Language

Preliminary Concepts

Interpretation of N-block Information

Programming Exercise

Grading Rubrics

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools