Difference between revisions of "Introduction to CoE 161"
Line 189: | Line 189: | ||
\end{align}</math>|{{EquationRef|17}}}} | \end{align}</math>|{{EquationRef|17}}}} | ||
− | Since <math>\sum_{i=1}^n\tfrac{1}{n} = 1</math>, we can use the | + | Since <math>\sum_{i=1}^n\tfrac{1}{n} = 1</math>, we can use the Gibbs Inequality in obtaining the last step. Note that in this case, the equality occurs when <math>p_i = \tfrac{1}{n}</math>. |
We can then rewrite eq. 17 as: | We can then rewrite eq. 17 as: |
Revision as of 22:31, 9 September 2020
Welcome to CoE 161 / CoE 197!
Since we are offering this class remotely, there will be many changes to our normal course delivery:
- There will be no face-to-face lecture classes. All the material will be made available via this site.
- There will be more emphasis on student-centric activities, e.g. analysis, design, and simulations. Thus, you will be mostly "learning by doing". In this context, we will set aside an hour every week for consultations and questions via video-conferencing.
- Grades will be based on the submitted deliverables from the activities. Though we will not be very strict regarding the deadlines, it is a good idea to keep up with the class schedule and avoid cramming later in the semester.
Let's get started!
Contents
Measuring Complexity
For any system we are studying, designing, or building, an interesting and important question is usually: "How complex is this system?" or "How complex should this system be?". In general, we want to be able to compare two systems, and be able to say that one system is more complex than the other. In this case, having a numerical metric would be very convenient.
Possible ways of measuring complexity (obviously not a complete list!):
- Human observation, thus making any kind of rating subjective.
- Complexity based on the number of parts, components, or distinct elements. However, this depends on the definition of what is a "part". This is dependent on the observer's scale, from functional components down to atoms at the extreme case, as shown in Fig. 1.
- Number of dimensions. This is sometimes hard to measure, e.g. degrees of freedom, etc.
- Number of parameters controlling the system.
- The minimal description needed. However, we need to figure out which language we need to use.
- Information content. We then need to figure out how to define and measure information.
- Minimum generator or constructor (to build the system). We need to define what tools and methods we need to use to build the system.
- Minimum energy and/or time to construct the system. However, some systems evolve with time, so defining the beginning and end points can be difficult.
All of these measures are associated with a model of the system in question. Thus observers could use different models, and therefore come up with different notions of the complexity of the said system. We do not expect to come up with a single universal measure of complexity, but we can explore a framework that is optimal for (a) a particular observer, (b) a particular context, and (c) a particular purpose. Let us focus on the measures related to how unexpected an observation is, an approach known as Information Theory.
Some of the ideas and concepts that follow are based on basic probability ideas. You can review them here: Probability Review I.
The Basics of Information Theory
We want to create a metric for measuring the information we get from observing the occurrence of an event that has a probability . Let us ignore all other features of the event, except whether the event occurred or not. We can then think of the event as seeing if a symbol, whose probability of occurring is , appears. Thus, our definition of information is in terms of the probability .
An Axiomatic Approach
Let us define our information measure as . We want to have several properties:
Axiom | Implied Property |
---|---|
Information is a non-negative quantity. |
|
If an event has a , then we get no information from the occurrence of that event. |
|
If two independent events occur, then the information we get from observing the two events is the sum of the two 'informations'. |
|
is a continuous and monotonic function of the probability . |
Using these properties, we can then derive the following expressions:
-
(1)
-
Thus, by induction:
-
(2)
-
We can then write , or equivalently:
-
(3)
-
In general, we get:
-
(4)
-
Since is continuous, and for and for a real number , we get:
-
(5)
-
From this, and by inspection, we can express , for some base as:
-
(5)
-
Note that:
-
(6)
-
Thus, we can easily change the base of the logarithm, and the resulting information measure are differ by just a multiplicative constant, and have the following units:
b | Units |
---|---|
2 | bits (from binary) |
3 | trits (from trinary) |
nats (from natural logarithm) | |
10 | Hartleys |
By default, and unless specified, we will be using , so when we write , we mean .
A Simple Example
If we flip a fair coin, we will get two possible events: , with probabilities . Thus, a one coin flip gives us bit of information, regardless of the result ( or ).
If we flip the coin times, or equivalently, if we flip coins at the same time, the amount of information (in bits) that we would get is equal to:
-
(7)
-
Enumerating the sequence of 16 flips would give us , or using a 1 for and 0 for , we would get 1101011011011011. Notice that fair coin flips gives us bits of information, and requires a binary word that is digits long to describe.
Introducing Entropy
Consider a source generating a stream of symbols. If there are distinct symbols, , and the source emits these symbols with probabilities respectively. Let us assume that the symbols are emitted independently, e.g. the emitted symbol does not depend on past symbols.
If we observe the symbol , we get bits of information from that particular observation.
For observations, the symbol would occur approximately times. Therefore, if we have independent observations, the total information, , that we will get is:
-
(8)
-
The average information that we get per symbol that we observe would then be equal to:
-
(9)
-
Note that since , we will define when .
The Definition of Entropy
Given a discrete probability distribution , the entropy of the distribution is:
-
(10)
-
For a continuous probability distribution, , entropy is defined as:
-
(11)
-
This definition was introduced by Claude Shannon in his landmark 1948 paper[1]. For more information on the life of Claude Shannon, you can read this article[2], or watch this documentary film[3].
Alternative Definition of Entropy
For a discrete probability distribution the expected value of an associated discrete set is given by:
-
(12)
-
Also, for a continuous probability distribution, , and an associated function :
-
(13)
-
Hence, for , or , we can see that:
-
(14)
-
Thus, we can think of entropy of a probability distribution as the expected value of the information of the distribution.
The Gibbs Inequality
Consider the natural logarithm function, . The slope of the the line tangent to at any point is equal to . Thus, at , the equation of the tangent line is , as seen in Fig. 2. Since is concave down, and for , we get the following inequality:
-
(15)
-
The equality occurs when .
For two probability distributions, and , and using eq. 15, we get the Gibbs Inequality:
-
(16)
-
This time, the equality occurs when for all . Though we derived the inequality for base , it is straightforward to show that the inequality holds for any base.
Let's use the Gibb Inequality to find the probability distribution that maximizes the entropy. For a probability distribution, :
-
(17)
-
Since , we can use the Gibbs Inequality in obtaining the last step. Note that in this case, the equality occurs when .
We can then rewrite eq. 17 as:
-
(18)
-
We can now look at the probability distributions at the upper and lower bounds of :
- At the lower bound, when one of the 's is equal to 1, and the rest are all zero.
- At the upper bound, when for all .
Thus, the maximum of the entropy function is equal to , where is the number of all possible events, and occurs when all the events are equally likely.