Difference between revisions of "2S2122 Activity 2.2"

From Microlab Classes
Jump to navigation Jump to search
Line 10: Line 10:
 
== Information and Language ==
 
== Information and Language ==
  
 +
One interesting application of recognizing how entropy plays a role in our everyday life is to determine the average information per letter in a language. In this programming exercise, we will explore and try to understand what entropy means for the English language. Before we begin let's set the context of our problem first.
  
 +
* We'll be characterizing the information content of the English language based only on the frequency of letters.
 +
* We'll expand this problem to a block of letters but get the relative entropy for a single letter only.
 +
* We'll exclude the conditional probabilities for this exercise because that will be more challenging to build.
 +
 +
Let's first begin by defining the word '''N-block''' as the number of letters we are combining to represent a single outcome in a random variable <math> X </math>. Consider the sentence:
 +
 +
 +
'''"The quick brown fox jumps over the lazy dog"'''
 +
 +
 +
When we say we want to get the '''1-block''', then <math> X </math> contains <math>\{t,h,e,q,u,i,c,k,b,r,o,w,n,f,x,j,m,p,s,v,l,a,z,y,d,g,\textrm{space}\} </math>. Of course the elements can be re-arranged in order <math> \{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,\textrm{space}\} </math>. The set also contains the space. For now let's consider spaces as an important component in our analysis. The sentence above is known to have all the letters in the alphabet. Remember, a random variable <math> X </math> would also have a probability distribution associated with it. We can determine the probability based on the frequency of letters for the entire set. For example, <math> \textrm{space} </math> appears <math> 8 </math> times, the letter <math> o </math> appears <math> 4 </math> times, the letter <math> e </math> appears <math> 3 </math> times and so on. The total number of occurrences of each outcome is <math> 43 </math>. The probability distribution would be:
 +
 +
* <math> P(a) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(b) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(c) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(d) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(e) = \frac{3}{43} = 0.070 </math>
 +
* <math> P(f) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(g) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(h) = \frac{2}{43} = 0.047 </math>
 +
* <math> P(i) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(j) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(k) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(l) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(m) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(n) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(o) = \frac{4}{43} = 0.093 </math>
 +
* <math> P(p) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(q) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(r) = \frac{2}{43} = 0.047 </math>
 +
* <math> P(s) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(t) = \frac{2}{43} = 0.047 </math>
 +
* <math> P(u) = \frac{2}{43} = 0.047 </math>
 +
* <math> P(v) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(w) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(x) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(y) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(z) = \frac{1}{43} = 0.023 </math>
 +
* <math> P(\textrm{space}) = \frac{8}{43} = 0.186 </math>
 +
 +
We expanded the distribution for clarity.
  
 
== Programming Exercise ==
 
== Programming Exercise ==

Revision as of 23:07, 13 February 2022

Instructions

There are two parts to this exercise. First, read through the discussion about information and language. Second, after understanding the discussion, we'll replicate it so that you can get hands on experience on the analysis.

Submission guidelines:

  • For every programming exercise, you are to submit your .ipynb file into your respective submission bin.
  • To download your .ipynb file, first in your Google Colab page go to File > Download > Download .ipynb file.
  • Don't forget to rename your .ipynb file with the following format "class_lastname_firstname_studentnumber.ipynb".

Information and Language

One interesting application of recognizing how entropy plays a role in our everyday life is to determine the average information per letter in a language. In this programming exercise, we will explore and try to understand what entropy means for the English language. Before we begin let's set the context of our problem first.

  • We'll be characterizing the information content of the English language based only on the frequency of letters.
  • We'll expand this problem to a block of letters but get the relative entropy for a single letter only.
  • We'll exclude the conditional probabilities for this exercise because that will be more challenging to build.

Let's first begin by defining the word N-block as the number of letters we are combining to represent a single outcome in a random variable . Consider the sentence:


"The quick brown fox jumps over the lazy dog"


When we say we want to get the 1-block, then contains . Of course the elements can be re-arranged in order . The set also contains the space. For now let's consider spaces as an important component in our analysis. The sentence above is known to have all the letters in the alphabet. Remember, a random variable would also have a probability distribution associated with it. We can determine the probability based on the frequency of letters for the entire set. For example, appears times, the letter appears times, the letter appears times and so on. The total number of occurrences of each outcome is . The probability distribution would be:

We expanded the distribution for clarity.

Programming Exercise