Difference between revisions of "Shannon's Communication Theory"

Revision as of 13:40, 17 September 2020

A First Look at Shannon's Communication Theory

Figure 1: A general communication system^[1].

In his landmark 1948 paper^[1], Claude Shannon developed a general model for communication systems, as well as a framework for analyzing these systems. The model has three components: (1) the sender or source, (2) the channel, and (3) the receiver or sink. The model also includes the transmitter that encodes the message into a signal, the receiver, for decoding the signal back into a message, as well the noise of the channel, as shown in Fig. 1.

In Shannon's discrete model, the source provides a stream of symbols from a finite alphabet, $A=\{a_{1},a_{2},\ldots ,a_{n}\}$ , which are then encoded. The code is sent through the channel, which could be corrupted by noise, and when the code reaches the other end, it is decoded by the receiver, and then the sink extracts information from the steam of symbols.

Note that sending information in space, e.g. from here to there is equivalent sending information in time, e.g. from now to then. Thus, Shannon's theory applies to both information transmission and information storage.

Shannon's Noiseless Coding Theorem

One very important question to ask is: "How efficiently can we encode the information that we want to send through the channel?" To answer this, let us assume: (1) the channel is noiseless, and (2) the receiver can accurately decode the symbols transmitted through the channel. First, let us define a few things.

A code is defined as a mapping from a source alphabet to a code alphabet. The process of transforming a source message into a coded message is coding or encoding. The encoded message may be referred to as an encoding of the source message. The algorithm which constructs the mapping and uses it to transform the source message is called the encoder. The decoder performs the inverse operation, restoring the coded message to its original form.

A code is distinct if each codeword is distinguishable from every other, i.e. the mapping from source messages to codewords is one-to-one. A distinct code is uniquely decodable (UD) if every codeword is identifiable when immersed in a sequence of codewords. A uniquely decodable code is a prefix code (or prefix-free code) if it has the prefix property, which requires that no codeword is a proper prefix of any other codeword. Prefix codes are instantaneously decodable, i.e. they have the desirable property that the coded message can be parsed into codewords without waiting for the end of the message.

The Kraft-McMillan Inequality

Let $C=\{c_{1},c_{2},\ldots ,c_{r}\}$ be the alphabet of the channel, or in other words, the set of symbols that can be sent through the channel. Thus, encoding the source alphabet $A$ can be expressed as a function, $f:A\rightarrow C^{*}$ , where $C^{*}$ is the set of all possible finite strings of symbols (or codewords) from $C$ .

Let $\ell _{i}=\left|f\left(a_{i}\right)\right|$ where $i=1,2,\ldots ,n$ , i.e. the length of the string of channel symbols encoding the symbol $a_{i}\in A$ .

A code with lengths $\ell _{1},\ell _{2},\ldots ,\ell _{n}$ is uniquely decodable if and only if:

K=\sum _{i=1}^{n}{\frac {1}{r^{\ell _{i}}}}\leq 1

(1)

The proof of the Kraft-McMillan Inequality is interesting since it starts with evaluating $K^{m}$ :

K^{m}=\left(\sum _{i=1}^{n}{\frac {1}{r^{\ell _{i}}}}\right)^{m}=\sum _{i_{1}=1}^{n}\sum _{i_{1}=1}^{n}\cdots \sum _{i_{m}=1}^{n}{\frac {1}{r^{\ell _{i_{1}}+\ell _{i_{2}}+\ldots +\ell _{i_{m}}}}}

(2)

Let $\ell =\max \left(\ell _{1},\ell _{2},\ldots ,\ell _{n}\right)$ . Thus, the minimum value of $\ell _{i_{1}}+\ell _{i_{2}}+\ldots +\ell _{i_{m}}$ is $m$ , when all the codewords are 1 bit long, and the maximum is $m\ell$ , when all the codewords have the maximum length. We can then write:

K^{m}=\sum _{k=m}^{m\ell }{\frac {N_{k}}{r^{k}}}

(3)

Where $N_{k}$ is the number of combinations of $m$ codewords that have a combined length of $k$ . Note that the number of distinct codewords of length $k$ is $r^{k}$ . If this code is uniquely decodable, then each sequence can represent one and only one sequence of codewords. Therefore, the number of possible combinations of codewords whose combined length is $k$ cannot be greater than $r^{k}$ , or:

N_{k}\leq r^{k}

(4)

We can then write:

K^{m}\leq \sum _{k=m}^{m\ell }{\frac {r^{k}}{r^{k}}}=m\ell -m+1

(5)

Thus, we can conclude that $K\leq 1$ since if this were not true, $K^{m}$ would exceed $m\ell -m+1$ for large $m$ .

Entropy and Coding

Let $Q_{i}$ be equal to:

Q_{i}={\frac {r^{-\ell _{i}}}{K}}

(6)

We call the set of numbers $Q_{i}$ pseudo-probabilities since $0<Q_{i}\leq 1$ for all $i$ , and

\sum _{i=1}^{n}Q_{i}=1

(7)

If $p_{i}$ is the probability of observing $a_{i}$ in the data stream, then we can apply the Gibbs Inequality to get:

\sum _{i=1}^{n}p_{i}\log _{2}\left({\frac {Q_{i}}{p_{i}}}\right)\leq 0

(8)

Rewriting, we get:

\sum _{i=1}^{n}p_{i}\log _{2}\left({\frac {1}{p_{i}}}\right)\leq \sum _{i=1}^{n}p_{i}\log _{2}\left({\frac {1}{Q_{i}}}\right)=\sum _{i=1}^{n}p_{i}\log _{2}\left({\frac {K}{r^{-\ell _{i}}}}\right)

(9)

Note that the left hand term is the entropy of the source, $H\left(S\right)$ and for $K\leq 1$ , we get:

H\left(S\right)\leq \sum _{i=1}^{n}p_{i}\left(\log _{2}\left(K\right)-\log _{2}\left(r^{-\ell _{i}}\right)\right)=\log _{2}\left(K\right)+\sum _{i=1}^{n}p_{i}\ell _{i}\log _{2}\left(r\right)\leq \log _{2}\left(r\right)\sum _{i=1}^{n}p_{i}\ell _{i}

(10)

If we define the average length of codewords, $L=\sum _{i=1}^{n}p_{i}\ell _{i}$ , and rewriting:

H\left(S\right)\leq L\log _{2}\left(r\right)

(11)

In other words, the entropy of the source gives us a lower bound on the average code length for any uniquely decodable symbol-by-symbol encoding of the source message. For binary encoding, where $r=2$ , we arrive at:

H\left(S\right)\leq L

(12)

Shannon went beyond this and showed that this bound holds even if we group symbols together into "words" before doing our encoding. The generalized form of this inequality is called Shannon's Noiseless Coding Theorem.

Shannon's Theorem

In general, the channel is itself can add noise. This means that the channel itself serves as an additional layer of uncertainty to our transmissions. Consider a channel with input symbols $A=\{a_{1},a_{2},\ldots ,a_{n}\}$ , and output symbols $B=\{b_{1},b_{2},\ldots ,b_{m}\}$ . Note that the input and output alphabets do not need to have the same number of symbols. Given the noise in the channel, if we observe the output symbol $b_{j}$ , we are not sure which $a_{i}$ was the input symbol. We can then characterize the channel as a set of probabilities $\{P\left(a_{i}\mid b_{j}\right)\}$ . Let us consider the information we get from observing a symbol $b_{j}$ .

Mutual Information

Given a probability model of the source, we have an a priori estimate $P\left(a_{i}\right)$ that symbol $a_{i}$ will be sent next. Upon observing $b_{j}$ , we can revise our estimate to $P\left(a_{i}\mid b_{j}\right)$ . The change in information, or mutual information, is given by:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)-\log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)}}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(13)

Let's look at a few properties of mutual information. Expressing the equation above in terms of $I\left(a_{i}\right)$ :

I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)+\log _{2}\left(P\left(a_{i}\mid b_{j}\right)\right)

(14)

Thus, we can say:

I\left(a_{i};b_{j}\right)\leq I\left(a_{i}\right)

(15)

This is expected since, after observing $b_{j}$ , the amount of uncertainty is reduced, i.e. we know a bit more about $a_{i}$ , and the most change in information we can get is when $a_{i}$ and $b_{j}$ are perfectly correlated, with $I\left(a_{i};b_{j}\right)=I\left(a_{i}\right)$ . From Bayes' Theorem, we have the property:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=I\left(b_{j};a_{i}\right)

(16)

Note that if $a_{i}$ and $b_{j}$ are independent, where $P\left(a_{i}\mid b_{j}\right)=P\left(a_{i}\right)$ and $P\left(b_{j}\mid a_{i}\right)=P\left(b_{j}\right)$ , then:

I\left(a_{i};b_{j}\right)=\log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)=\log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)=\log _{2}\left(1\right)=0

(17)

We can get the average mutual information over all the input symbols as:

I\left(A;b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot I\left(a_{i};b_{j}\right)=\sum _{i=1}^{n}P\left(a_{i}\mid b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\mid b_{j}\right)}{P\left(a_{i}\right)}}\right)

(18)

Similarly, for all the output symbols:

I\left(a_{i};B\right)=\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)

(19)

For both input and output symbols, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\cdot I\left(a_{i};B\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\cdot \log _{2}\left({\frac {P\left(b_{j}\mid a_{i}\right)}{P\left(b_{j}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=I\left(B;A\right)\end{aligned}}

(20)

Non-Negativity of Mutual Information

To show the non-negativity of mutual information, let us use Jensen's Inequality, which states that for a convex function, $f\left(x\right)$ :

\langle f\left(x\right)\rangle \geq f\left(\langle x\rangle \right)

(21)

Using the fact that $f\left(x\right)=-\log _{2}\left(x\right)$ is convex, and applying this to our expression for mutual information, we get:

{\begin{aligned}I\left(A;B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i},b_{j}\right)}{P\left(a_{i}\right)\cdot P\left(b_{j}\right)}}\right)\\&=-\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot \log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&=\left\langle -\log _{2}\left({\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\right\rangle \\&\geq -\log _{2}\left(\left\langle {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right\rangle \right)=-\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\cdot {\frac {P\left(a_{i}\right)\cdot P\left(b_{j}\right)}{P\left(a_{i},b_{j}\right)}}\right)\\&\geq -\log _{2}\left(\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)\cdot P\left(b_{j}\right)\right)=-\log _{2}\left(\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\right)\right)=-\log _{2}\left(1\right)\\&\geq 0\\\end{aligned}}

(22)

Note that $I\left(A;B\right)=0$ when $A$ and $B$ are independent.

Conditional and Joint Entropy

Given $A$ and $B$ , and their entropies:

H\left(A\right)=\sum _{i=1}^{n}P\left(a_{i}\right)\log _{2}\left({\frac {1}{P\left(a_{i}\right)}}\right)

(23)

H\left(B\right)=\sum _{j=1}^{m}P\left(b_{j}\right)\log _{2}\left({\frac {1}{P\left(b_{j}\right)}}\right)

(24)

The conditional entropy is a measure of the average uncertainty about $B$ when $A$ is known, and we can define it as:

{\begin{aligned}H\left(B\mid A\right)&=\sum _{i=1}^{n}P\left(a_{i}\right)\sum _{j=1}^{m}P\left(b_{j}\mid a_{i}\right)\log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i}\right)P\left(b_{j}\mid a_{i}\right)\log _{2}\left({\frac {1}{P\left(b_{j}\mid a_{i}\right)}}\right)\\&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\log _{2}\left({\frac {P\left(a_{i}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\\end{aligned}}

(25)

And similarly,

{\begin{aligned}H\left(A\mid B\right)&=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(b_{j},a_{i}\right)\log _{2}\left({\frac {P\left(b_{j}\right)}{P\left(b_{j},a_{i}\right)}}\right)\\&\neq H\left(B\mid A\right)\\\end{aligned}}

(26)

We can extend the definition of entropy to two (or more) random variables. This is known as the joint entropy, and is defined as:

H\left(A,B\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\log _{2}\left({\frac {1}{P\left(a_{i},b_{j}\right)}}\right)

(27)

Expanding expression for joint entropy, we get:

H\left(A,B\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}P\left(a_{i},b_{j}\right)\log _{2}\left({\frac {1}{P\left(a_{i}\mid b_{j}\right)P\left(b_{j}\right)}}\right)

(28)

Channel Capacity

Shannon's Theory for Analog Channels

Kullback-Leibler Information Measure

Sources

Tom Carter's notes on Information Theory
Dan Hirschberg's notes on Data Compression

References

↑ ^{Jump up to: 1.0} ^1.1 C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. (pdf)

[shannon1948-1] {Jump up to: 1.0} ^1.1 C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. (pdf)

[1]

@@ Line 156: / Line 156: @@
 {{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\log_2\left(\frac{1}{P\left(a_i, b_j\right)}\right)</math>|{{EquationRef|27}}}}
+Expanding expression for joint entropy, we get:
+{{NumBlk|::|<math>H\left(A, B\right)=\sum_{i=1}^n \sum_{j=1}^m P\left(a_i, b_j\right)\log_2\left(\frac{1}{P\left(a_i\mid b_j\right)P\left(b_j\right)}\right)</math>|{{EquationRef|28}}}}
 ==== Channel Capacity ====

Difference between revisions of "Shannon's Communication Theory"

Revision as of 13:40, 17 September 2020

Contents

A First Look at Shannon's Communication Theory

Shannon's Noiseless Coding Theorem

The Kraft-McMillan Inequality

Entropy and Coding

Shannon's Theorem

Mutual Information

Non-Negativity of Mutual Information

Conditional and Joint Entropy

Channel Capacity

Shannon's Theory for Analog Channels

Kullback-Leibler Information Measure

Sources

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools