Difference between revisions of "Independence and Markov Chains"

Latest revision as of 13:23, 27 March 2021

Independence as factorability

If $X$ and $Y$ are discrete random variables, we can use Bayes' theorem to write their joint distribution as $p(X,Y)=p(X)p(Y|X)$ or $p(X,Y)=p(Y)p(X|Y)$ . If the two random variables are independent, we can get rid of conditioning and write $p(X,Y)=p(X)p(Y)$ . As a shorthand, we write $X\perp Y$ to mean that $X$ and $Y$ are independent. Immediately, we see that independence brings some convenience in allowing us to factor a joint distribution into the product of their marginal distributions. The computational ease it brings becomes more apparent when we deal with multiple random variables.

Mutual independence

For jointly distributed random variables $X_{1},X_{2},...,X_{n}$ , an extension of Bayes' theorem gives us the following, not-so-convenient, factorization

$p(X_{1},X_{2},...,X_{n})=p(X_{1})p(X_{2}|X_{1})p(X_{3}|X_{1},X_{2})\cdots p(X_{n}|X_{1},...,X_{n-1})$ ,

but if the $n$ random variables are mutually independent, then the joint distribution reduces to

$p(X_{1},X_{2},...,X_{n})=p(X_{1})p(X_{2})\cdots p(X_{n}).$

Of course, whenever possible, such simplifications are most welcome. However, mutual independence between many random variables is too restrictive to be useful, and is seldom encountered in many practical settings. A more relaxed notion is that of pairwise independence, which only assumes that every pair of random variables is independent. Obviously, pairwise independence immediately follows from mutual independence. The reverse direction, however, is not true in general.

Checkpoint: Give an example of three random variables that are pairwise independent but not mutually independent.

Conditional independence

Another useful notion of independence is that of conditional independence. Let us start by writing out the definition. We say that two random variables $X$ and $Y$ are conditionally independent given a third random variable $Z$ if we can write the joint distribution $p(X,Y,Z)$ as

$p(X,Y,Z)=\alpha (X,Z)\beta (Y,Z),$

for some functions $\alpha$ and $\beta$ . This definition should make sense if you temporarily "cover" the variable $Z$ . Note that the function $\alpha$ and $\beta$ do not necessarily stand for the joint distributions $p(X,Z)$ and $p(Y,Z)$ , respectively.

Markov chains

Markov chains are sequences of random variables that takes conditional independence a step further. We start with the formal definition, and then illustrate why it makes sense to call them chains.

Definition

A sequence of random variables $X_{1},X_{2},...,X_{n}$ is a Markov chain, denoted by $X_{1}\rightarrow X_{2}\rightarrow \cdots \rightarrow X_{n}$ , if it satisfies the following (equivalent) properties:

For any $k>1$ , we have $p(X_{k}|X_{k-1},X_{k-2},...,X_{1})=p(X_{k}|X_{k-1})$ .
The joint distribution factorizes as follows: $p(X_{1},X_{2},...,X_{n})=p(X_{1},X_{2})p(X_{3}|X_{2})\cdots p(X_{n}|X_{n-1})$ .

The first property is known as the Markov property. Informally, we can think of the game Snakes and Ladders with $X_{k}$ representing our next move. If we had done $(k-1)$ moves before the kth one, then the first $k-2$ moves are irrelevant once we know the most recent move $X_{k-1}$ . We can illustrate this effect by considering a chain linking $X_{1},...,X_{n}$ together. Once we know a random variable $X_{k}$ somewhere in the middle, the chain "breaks" and any pair of random variables from opposite sides of the breakage become conditionally independent. For example, in the figure below, $X_{6}$ is conditionally independent of $X_{1},...,X_{4}$ given $X_{5}$ .

The second property should remind us of the factorization property that is (1) simple enough for us not to be bogged down by the full expansion solely based on Bayes' theorem, and (2) much less restrictive than requiring all random variables to be mutually independent. Upon noting that $p(X_{1},X_{2})=p(X_{1})p(X_{2}|X_{1})$ , the second property provides us the following procedure for getting the joint distribution:

Obtain the probability $p(X_{1})$ .
For $k>1$ , multiply the conditional probability $p(X_{k}|X_{k-1})$ , remembering the previous rv ( $X_{k-1}$ ) and forgetting all previous rvs ( $X_{1},X_{2},...X_{k-2}$ ).

Just like in the case of conditionally independent rvs, we can characterize Markov chain by the following more general factorization:

$p(X_{1},X_{2},...,X_{n})=f_{1}(X_{1},X_{2})f_{2}(X_{2},X_{3})\cdot f_{n-1}(X_{n-1},X_{n}),$

for some functions $f_{1},f_{2},...,f_{n-1}$ . Again, these functions do not necessarily correspond to joint distributions between adjacent rvs $X_{k}$ and $X_{k+1}$ .

Some properties

Markov chains are prevalent in situations where we pass messages drawn from an information source into successive stages of processing. A main result for Markov chains is that data processing, in general, results in a decrease in the amount of information. In the next discussion, we formalize this notion using information measures and identify cases where data processing can be performed without loss of information. As such, we need to be familiar with how we can manipulate Markov chains. Below are some properties that are useful in studying Markov chains:

Let $X_{1}\rightarrow X_{2}\rightarrow \cdots \rightarrow X_{n}$ be a Markov chain, then the following are also Markov chains:

$X_{n}\rightarrow X_{n-1}\rightarrow \cdots ...\rightarrow X_{1}$
$(X_{1},X_{2},...,X_{k-1})\rightarrow X_{k}\rightarrow \cdots \rightarrow X_{n}$
$X_{i_{1}}\rightarrow X_{i_{2}}\rightarrow \cdots \rightarrow X_{i_{m}}$ , where $1\leq i_{1}<i_{2}<...<i_{m}\leq n$ .

Let us provide intuitive, English descriptions of the above properties. The first property tells us that Markov chains are bidirectional. The second property is a telescoping property, which allows us to "collapse" a Markov chain by grouping the first few entries. The third property tells us that any order-preserving subsequence from a Markov chain is a Markov chain.

Checkpoint: Let

X_{1}\rightarrow X_{2}\rightarrow ...\rightarrow X_{6}

be a Markov chain. Use the properties in this section to show that the following are Markov chains:

$X_{1}\rightarrow X_{2}\rightarrow X_{3}$
$(X_{2},X_{1})\rightarrow X_{3}\rightarrow (X_{4},X_{6})$
$X_{1}\rightarrow (X_{1},X_{2})\rightarrow (X_{1},X_{2},X_{3})\rightarrow (X_{1},X_{2},X_{3},X_{4})$

@@ Line 6: / Line 6: @@
 For jointly distributed random variables <math>X_1, X_2, ..., X_n </math>, an extension of Bayes' theorem gives us the following, not-so-convenient, factorization
-<math> p(X_1, X_2, ..., X_n) = p(X_1)p(X_2 | X_1) p(X_3 |X_1, X_2) \cdots p(X_n | X_1, ..., X_{n-1}</math>,
+<math> p(X_1, X_2, ..., X_n) = p(X_1)p(X_2 | X_1) p(X_3 |X_1, X_2) \cdots p(X_n | X_1, ..., X_{n-1})</math>,
 but if the <math>n</math> random variables are ''mutually independent'', then the joint distribution reduces to
@@ Line 14: / Line 14: @@
 Of course, whenever possible, such simplifications are most welcome. However, ''mutual independence'' between many random variables is too restrictive to be useful, and is seldom encountered in many practical settings. A more relaxed notion is that of ''pairwise independence,'' which only assumes that every pair of random variables is independent. Obviously, pairwise independence immediately follows from mutual independence. The reverse direction, however, is not true in general.
-Checkpoint: Give an example of three random variables that are pairwise independent but not mutually independent.
+{{Note|Checkpoint: Give an example of three random variables that are pairwise independent but not mutually independent.}}
 === Conditional independence ===
@@ Line 33: / Line 33: @@
 The first property is known as the Markov property. Informally, we can think of the game ''Snakes and Ladders'' with <math>X_k</math> representing our next move. If we had done <math>(k-1)</math> moves before the kth one, then the first <math>k-2</math> moves are irrelevant once we know the most recent move <math>X_{k-1}</math>. We can illustrate this effect by considering a chain linking <math>X_1, ..., X_n</math> together. Once we know a random variable <math>X_k</math> somewhere in the middle, the chain "breaks" and any pair of random variables from opposite sides of the breakage become conditionally independent. For example, in the figure below, <math>X_6</math> is conditionally independent of <math>X_1, ..., X_4</math> given <math>X_5</math>.
+[[File:Markov-chain.svg]]
 The second property should remind us of the factorization property that is (1) simple enough for us not to be bogged down by the full expansion solely based on Bayes' theorem, and (2) much less restrictive than requiring all random variables to be mutually independent. Upon noting that <math>p(X_1,X_2) = p(X_1)p(X_2|X_1)</math>, the second property provides us the following procedure for getting the joint distribution:
 * Obtain the probability <math>p(X_1)</math>.
 * For <math>k>1</math>, multiply the conditional probability <math>p(X_k | X_{k-1})</math>, remembering the previous rv (<math>X_{k-1}</math>) and forgetting all previous rvs (<math>X_{1}, X_{2}, ... X_{k-2}</math>).
+Just like in the case of conditionally independent rvs, we can characterize Markov chain by the following more general factorization:
+<math>p(X_1,X_2,...,X_n) = f_1(X_1,X_2)f_2(X_2,X_3)\cdot f_{n-1}(X_{n-1},X_n),</math>
+for some functions <math>f_1,f_2,...,f_{n-1}</math>. Again, these functions do not necessarily correspond to joint distributions between adjacent rvs <math>X_k</math> and <math>X_{k+1}</math>.
 === Some properties ===
-Markov chains are prevalent in situations where we pass messages drawn from an information source into multiple stages of processing. A main result for Markov chains is that data processing, in general, results in a decrease in the amount of information. In the next discussion, we formalize this notion using information measures and identify cases where data processing can be performed without loss of information. As such, we need to be familiar with how we can manipulate Markov chains. Below are some properties that are useful in studying Markov chains:
+Markov chains are prevalent in situations where we pass messages drawn from an information source into successive stages of processing. A main result for Markov chains is that data processing, in general, results in a decrease in the amount of information. In the next discussion, we formalize this notion using information measures and identify cases where data processing can be performed without loss of information. As such, we need to be familiar with how we can manipulate Markov chains. Below are some properties that are useful in studying Markov chains:
 Let <math>X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_n</math> be a Markov chain, then the following are also Markov chains:

Difference between revisions of "Independence and Markov Chains"

Latest revision as of 13:23, 27 March 2021

Contents

Independence as factorability

Mutual independence

Conditional independence

Markov chains

Definition

Some properties

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools