Independence and Markov Chains

From Microlab Classes
Revision as of 13:23, 27 March 2021 by Adrian Vidal (talk | contribs) (→‎Some properties)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Independence as factorability

If and are discrete random variables, we can use Bayes' theorem to write their joint distribution as or . If the two random variables are independent, we can get rid of conditioning and write . As a shorthand, we write to mean that and are independent. Immediately, we see that independence brings some convenience in allowing us to factor a joint distribution into the product of their marginal distributions. The computational ease it brings becomes more apparent when we deal with multiple random variables.

Mutual independence

For jointly distributed random variables , an extension of Bayes' theorem gives us the following, not-so-convenient, factorization

,

but if the random variables are mutually independent, then the joint distribution reduces to

Of course, whenever possible, such simplifications are most welcome. However, mutual independence between many random variables is too restrictive to be useful, and is seldom encountered in many practical settings. A more relaxed notion is that of pairwise independence, which only assumes that every pair of random variables is independent. Obviously, pairwise independence immediately follows from mutual independence. The reverse direction, however, is not true in general.

Checkpoint: Give an example of three random variables that are pairwise independent but not mutually independent.

Conditional independence

Another useful notion of independence is that of conditional independence. Let us start by writing out the definition. We say that two random variables and are conditionally independent given a third random variable if we can write the joint distribution as

for some functions and . This definition should make sense if you temporarily "cover" the variable . Note that the function and do not necessarily stand for the joint distributions and , respectively.

Markov chains

Markov chains are sequences of random variables that takes conditional independence a step further. We start with the formal definition, and then illustrate why it makes sense to call them chains.

Definition

A sequence of random variables is a Markov chain, denoted by , if it satisfies the following (equivalent) properties:

  • For any , we have .
  • The joint distribution factorizes as follows: .

The first property is known as the Markov property. Informally, we can think of the game Snakes and Ladders with representing our next move. If we had done moves before the kth one, then the first moves are irrelevant once we know the most recent move . We can illustrate this effect by considering a chain linking together. Once we know a random variable somewhere in the middle, the chain "breaks" and any pair of random variables from opposite sides of the breakage become conditionally independent. For example, in the figure below, is conditionally independent of given .

Markov-chain.svg

The second property should remind us of the factorization property that is (1) simple enough for us not to be bogged down by the full expansion solely based on Bayes' theorem, and (2) much less restrictive than requiring all random variables to be mutually independent. Upon noting that , the second property provides us the following procedure for getting the joint distribution:

  • Obtain the probability .
  • For , multiply the conditional probability , remembering the previous rv () and forgetting all previous rvs ().

Just like in the case of conditionally independent rvs, we can characterize Markov chain by the following more general factorization:

for some functions . Again, these functions do not necessarily correspond to joint distributions between adjacent rvs and .

Some properties

Markov chains are prevalent in situations where we pass messages drawn from an information source into successive stages of processing. A main result for Markov chains is that data processing, in general, results in a decrease in the amount of information. In the next discussion, we formalize this notion using information measures and identify cases where data processing can be performed without loss of information. As such, we need to be familiar with how we can manipulate Markov chains. Below are some properties that are useful in studying Markov chains:

Let be a Markov chain, then the following are also Markov chains:

  • , where .

Let us provide intuitive, English descriptions of the above properties. The first property tells us that Markov chains are bidirectional. The second property is a telescoping property, which allows us to "collapse" a Markov chain by grouping the first few entries. The third property tells us that any order-preserving subsequence from a Markov chain is a Markov chain.


Checkpoint: Let be a Markov chain. Use the properties in this section to show that the following are Markov chains: