These are my lecture notes on Prof. Raymond W. Yeung's excellent course on information theory. The course is based on Prog. Yeung's book "Information Theory and Network Coding" and these notes are organized around that.

Information theory is a major research field in communication theory and applied probability. This is not meant as a comprehensive guide but rather a loose collection of definitions and short summaries.

1. The Science of Information

Two key concepts:

Shannon's Seminal Paper (1948)

2. Information Measures

2.1 Independence and Markov Chain

Random variables \(X\) and \(Y\) are called independent, denoted by \(X\perp Y\) if: $$ \forall (x,y) \in \mathcal{X\times Y}\colon\, p(x, y)=p(x)p(y) $$
For \(n\ge3\), random variables \(X_1, X_2,...X_n\) are mutually independent if: $$ \forall (x_1, x_2,...,x_n)\colon\, p(x_1, x_2,...,x_n)=p(x_1)p(x_2)\cdots p(x_n) $$
For \(n\ge3\), random variables \(X_1, X_2,...X_n\) are pairwise independent if all \(X_i\) and \(X_j\) are independent.
Mutual independence \(\implies\) pairwise independence

For random variables \(X\), \(Y\) and \(Z\), \(X\) is independent of \(Z\) conditioning on \(Y\), denoted by \((X\perp Z)\mid Y\) (conditional independence) $$ p(x, y, z) = \Bigg\{ \begin{array}{lr} \frac{p(x,y)p(y,z)}{p(y)}\,(\ast), & \text{if } p(y)>0\\ 0, & \text{otherwise} \end{array} $$
if \(p(y)>0\) then $$ p(x,y,z)=(\ast)\,\frac{p(x,y)p(y,z)}{p(y)}=p(x,y)p(z|y) $$
"In other words, \(A\) and \(B\) are conditionally independent given \(C\) if and only if, given knowledge that \(C\) occurs, knowledge of whether \(A\) occurs provides no information on the likelihood of \(B\) occurring, and knowledge of whether \(B\) occurs provides no information on the likelihood of \(A\) occurring." [1] In the picture on the right \(R\) and \(B\) are conditionally independent given \(Y\) but not given \(\overline Y\)

For random variables \(X\), \(Y\) and \(Z\) $$ \forall (x, y, z),\, p(y)>0\colon\, [(X\perp Z)\mid Y \iff p(x,y,z) = a(x,y)b(y,z)] $$ Meaning \(p(x,y,z)\) can be factorized as the given term, where \(a\) is a function that depends only on \(x, y\) and \(b\) is a function that depends only on \(y, z\).

For random variables \(X_1, X_2,\ldots,X_n\) where \(n\ge 3\), \(X_1→X_2→\cdots→X_n\) forms a Markov chain if $$ p(x_1, x_2,\ldots,x_n) = \\\Bigg\{\begin{array}{lr} p(x_1, x_2)p(x_3|x_2)\cdots p(x_n|x_{n-1}) & \text{if } p(x_2), p(x_3),\ldots,p(x_{n-1})>0\\ 0, & \text{otherwise} \end{array} $$
or equivalently when \(p(x_1, \ldots, x_n)p(x_2)\cdots p(x_{n-1}) = p(x_1, x_2)p(x_2, x_3)\cdots p(x_{n-1}, x_n)\).
Note that this definition allows for non-stationary transition probabilities, i.e. the probability distribution can still change over time, see stationarity below.

Markov subchains: Let \(\mathcal{N}_n=\{1,2,\ldots n\}\) and let \(X_1→X_2→\cdots→X_n\) form a Markov chain. For any subset \(\alpha\) of \(\mathcal{N}_n\) denote \((X_i, i \in \alpha)\) (a collection of random Variables) by \(X_\alpha\). Then for any disjoint subsets \(\alpha_1, \alpha_2, \ldots \alpha_m\) of \(\mathcal{N}_n\) such that $$ k_1 < k_2 < \cdots < k_m $$ for all \(k_j \in \alpha_j, \,j=1,2,\ldots, m\), $$ X_{\alpha_1}→X_{\alpha_2}→\cdots→X_{\alpha_n} $$ forms a Markov chain. Thats is, a subchain of \(X_1→X_2→\cdots→X_n\) is also a Markov chain.

2.2 Shannon's Information Measures

Shannon introduced these basic measures of information:

All of the above information measures can be expressed in terms of conditional mutual information.
The following equalities hold:

2.3 Continuity of Shannon's Information Measures

All of the information measures described above are continuous for fixed finite alphabets with respect to convergence in variational distance (\(\mathcal{L}_1\) distance): $$ V(p, q) = \sum_{x\in\mathcal{X}} |p(x) - q(x)| $$

2.4 Chain Rules

2.5 Informational Divergence

The informational divergence (Kullback-Leibler distance/relative entropy) between two probability distributions \(p\) and \(q\) on a common alphabet \(\mathcal{X}\) is defined as: $$ D(p\|q)=\sum_xp(x)\log\frac{p(x)}{q(x)} = E_p\log\frac{p(X)}{q(X)} \underset{\text{"Div. Ineq."}}{\ge} 0 $$ (where \(E_p\) denotes expectation with respect to \(p\), note that \(D\) is not symmetrical)

Fundamental Inequality: \(\ln a \le a-1\) for \(a > 0\)
Log-Sum Inequality: $$ \sum_i a_i \log \frac{a_i}{b_i} \ge \bigg(\sum_i a_i\bigg) \log \frac{\sum_i a_i}{\sum_i b_i} $$ Log-Sum Inequality \(\iff\) Divergence Inequality

Pinsker's Inequality: \(D(p\|q) \ge \frac{1}{2\ln 2} V^2(p, q)\)
In particular, convergence in informational divergence \(\implies\) convergence in variational distance

2.6 The Basic Inequalities

2.7 Some Useful Information Inequalities

2.8 Fano's Inequality

Theorem: \(H(X) \le \log |\mathcal{X}|\). Therefore if \(\mathcal{X}\) is finite then \(H(X)\) is finite.

A random variable with an infinite alphabet can have finite or infinite entropy, see \(X\) with \(\Pr\{X=i\}=2^{-i}\) (example 2.45) or example 2.46 in the textbook.

Fano's Inequality
Let \(X\) and \(\hat X\) be random variables on the alphabet \(\mathcal{X}\). Then $$ H(X|\hat X) \le h_b(P_e)+P_e \log (|\mathcal{X}|-1) $$ where \(P_e = \Pr\{X\neq\hat X\}\) and \(h_b\) is the binary entropy function.
Suppose \(\hat X\) is an estimate on \(X\). If the error probability \(P_e\) is small, then \(H(X|\hat X)\) should be small as well.

For a finite alphabet \(P_e → 0 \implies H(X|\hat X) → 0\)

Corollary: \(H(X|\hat X) \le 1+P_e \log (|\mathcal{X}|)\)

2.10 Entropy Rate of a Stationary Source

A discrete-time information source can be modeled as a discrete-time random process \(P=\{X_k, k \ge 1\}\). \(P\) is an infinite collection of random variables indexed by the set of positive integers. The index \(k\) is referred to as the "time" index. Random variables \(X_k\) are called letters. We assume that all \(H(X_k)\) are finite.

The total entropy of a finite subset of \(\{X_k\}\) (\(H(X_k, k\in A)\) where \(A \subset P\)) is finite because of the independence bound for entropy and our assumption that all \(H(X_k)\) are finite. Apart from special cases the joint entropy of an infinite collection of letters is usually infinite, therefore it is generally not meaningful to discuss the entropy of \(P\).

An information source \(\{X_k\}\) is called stationary if \(X_1, X_2, \ldots, X_m\) and \(X_{1+l}, X_{2+l}, \ldots, X_{m+l}\) have the same joint distribution for any \(m, l \ge 1\).

For a sequence \((a_n)\), the Cesáro mean is defined as the sequence \((b_n)\), where \(b_n = \frac{1}{n}\sum_{i=1}^n a_i\). If \(a_n → a\) then \(b_n → a\).

Source stationary \(\implies\) the limit \(H_X^\prime\) exists \(\implies\) the entropy rate \(H_X\) exists and is equal to \(H_X^\prime\). (proof using stationarity and Cesáro mean) Therefore for stationary sources, \(H_X^\prime\) is simply an alternative definition/interpretation of the entropy rate.

3. The I-Measure

Shannon's information measures for \(n \ge 2\) random variables have a set-theoretic structure.


3.1 Preliminaries

The field \(\mathcal{F}_n\) generated by sets \(\tilde X_1, \tilde X_2, \ldots, \tilde X_n\) is the collection of sets which can be obtained by any sequence of usual set operations on \(\tilde X_1, \tilde X_2, \ldots, \tilde X_n\).

The atoms of \(\mathcal{F}_n\) are sets of the form \(\bigcap_{i=1}^n Y_i\) where \(Y_i\) is either \(\tilde X_i\) or \(\tilde X_1^C\).

A real function \(\mu\) defined on \(\mathcal{F}_n\) is called a signed measure if it is set-additive: for disjoint \(A, B\) in \(\mathcal{F}_n\): \(\mu(A \cup B) = \mu(A) + \mu(B)\).

A signed measure on \(\mathcal{F}_n\) is completely specified by its values on the atoms of \(\mathcal{F}_n\) , \(\{\mu(A), A \in \mathcal{A} \}\), since the values on other sets in \(\mathcal{F}_n\) can be obtained via set-additivity.
(\(\ast\)) A signed measure is furthermore completely specified by it's values on all unions \(\{\mu(B), B \in \mathcal{B} \}\) where \(\mathcal{B} = \{\tilde X_G : G \subset \mathcal{N_n}\}\).

3.3 Construction of the I-Measure \(\mu^\ast\)

Definition I-measure: Define \(\mu^\ast\) by setting $$\mu^\ast(\tilde X_G) = H(X_G)$$ for all non-empty \(G\subset \mathcal{N_n}\). This completely defines the measure as by (\(\ast\)). \(\mu^\ast\) is the unique signed measure on \(\mathcal{F_n}\) which is consistent with all information measures.

To show that \(\mu^\ast\) is consistent with all information measures we only have to show that it is consistent with conditional mutual information, i.e. that \(\mu^\ast(\tilde X_G \cap \tilde X_{G'}-\tilde X_{G''}) = H(X_G; X_{G'}|X_{G''})\). This can be proved using Lemmas 3.7 and 3.8 in the textbook.

Using the I-measure we can employ set-theoretic tools to manipulate expressions of Shannon's information measures.

3.4 \(\mu^\ast\) can be Negative

\(\mu^\ast\) is positive for all non-empty atoms that correspond to Shannon's information measures. It can however be negative in other cases such as \(\mu^\ast(\tilde X_1 \cap \tilde X_2 \cap \tilde X_3) = I(X_1;X_2;X_3)\), which is not a Shannon's information measure.

For two or fewer random variables all three non-empty atoms correspond to Shannon's information measures so here \(\mu^\ast\) is always positive.

3.5 Information Diagrams

Due to the correspondence between information and set theory we can use Venn diagrams to visualize information measures. In \(n\) dimensions, any information diagram for \(n+1\) dimensions can be displayed perfectly.
Here is an information diagram for four random variables:
We can always omit atoms from an information diagram on which \(\mu^\ast\) takes the value \(0\). In particular, this occurs when certain Markov conditions are imposed on the random variables, see the following information diagram for a Markov chain:

Shannon's Perfect Secrecy Theorem
Given plaintext \(X\), ciphertext \(Y\) and key \(Z\).

These requirements imply that \(H(Z) \ge H(X)\), i.e. the length of the key is \(\ge\) the length of the plaintext.
Imperfect Secrecy Theorem
Given plaintext \(X\), ciphertext \(Y\) and key \(Z\). Decipherability: \(H(X|Y, Z) = 0\)
Generalization of Shannon's perfect secrecy theorem above: \(I(X;Y) \ge H(X) - H(Z)\)
Interpretation: \(I(X; Y)\) measures the leakage of information into the ciphertext
The data processing theorem can also be proved using information diagrams.

Proving things with information diagrams:

4. Zero-Error Data Compression

4.1 The Entropy Bound

A D-ary source code \(\mathcal{C}\) for a source random variable \(X\) is a function \(\mathcal{C}\colon \mathcal{X}→\mathcal{D^\ast}\), where \(\mathcal{D}^\ast\) is the set of all finite strings from a D-ary alphabet.

A code \(\mathcal{C}\) is uniquely decodable if for any string in \(\mathcal{D^\ast}\) the function can be inverted to get the unique source sequence that generated it.

Kraft Inequality Let \(\mathcal{C}\) be a D-ary source code and let \(l_1, l_2, \ldots, l_m\) be the lengths of the codewords. If \(\mathcal{C}\) is uniquely decodable then, $$ \sum_{k=1}^m D^{-l_k}\le 1 $$
Let \(X\) be a source random variable with \(X \sim \{p_1, p_2, \ldots, p_m\}\). Then the expected code length \(L\) (a measure of efficiency) of a source code \(\mathcal{C}\) is \(\sum_i p_il_i\).

For a D-ary uniquely decodable code \(\mathcal{C}\) for a source variable \(X\) we can furthermore establish the entropy bound \(H_D(X) \le L\), since each D-ary symbol can carry at most 1 D-it of information.

The redundancy \(R\) of a D-ary uniquely decodable code is \(L-H_D(X) \underset{H \text{-bound}}{\ge} 0\)

4.2 Prefix Codes

A code is prefix-free if no codeword is a prefix of another codeword. Such codes are called prefix codes. Prefix codes are uniquely decodable.

A tree representation of a prefix code is called a code tree.
Instantaneous decoding can be done by tracing the code tree from the root, starting at the beginning of the stream of coded symbols. In this way, the boundaries of the codewords can be discovered. A prefix code is said to be self-punctuating.

There exists a D-ary prefix code with codeword lengths \(l_1, l_2, \ldots, l_m \iff\) the Kraft inequality is satisfied.

A probability distribution is called D-adic (or dyadic when D=2) when \(p_i = D^{-t_i}\) for all \(i\), where \(t_i\) is an integer. There exists a D-ary prefix code which achieves the entropy bound for a distribution \(\{p_i\}\) if and only if \(\{p_i\}\) is D-adic.

A code for a source is called optimal if it has the shortest expected length of all codes for this source (an optimal code need not achieve the entropy bound). In an optimal code, words with high probability are associated with shorter codewords.
Huffman codes are a simple construction of optimal prefix codes. The expected length of a Huffman code is bounded: \(H_D(X) \underset{H \text{-bound}}{\le} L_\text{Huff} < H_D(X) + 1\).
When encoding a sequence \(X_1, \ldots X_n\) with a Huffman code \(nH(X) \le L_\text{Huff}^n < nH(X) + 1\) holds. So \(H(X) \le \frac{1}{n} (L_\text{Huff})^n < H(X) + \frac{1}{n}\). Therefore \(\frac{1}{n} (L_\text{Huff})^n\), which is called the rate of the code (in D-it per source symbol), goes to \(H(X)\) as \(n→\infty\).

4.3 Redundancy of Prefix Codes

Let \(q_k\) be the reaching probability of an internal node \(k\) in a code tree. Then \(q_k\) is equal to the sum of the probabilities of all the leaves descending from node \(k\).

Let \(\tilde p_{k,j}\) be the branching probability of the \(j\)-th branch of internal node \(k\). Then \(q_k = \sum_j \tilde p_{k,j}\).

The conditional branching distribution at node \(k\) is \(P = \Big \{ \frac{\tilde p_{k,0}}{q_k}, \frac{\tilde p_{k,1}}{q_k}\, \ldots, \frac{\tilde p_{k,D-1}}{q_k} \Big\}\)

The conditional entropy of node \(k\) is \(h_k = H_D(P) \le \log_d D = 1\)
\(h_k\) measures the amount of information that the next symbol to be revealed carries given that the symbols already revealed led to internal node \(k\).

\(H_D(X) = \sum_k q_k h_k\) and \(L=\sum_k q_k\) hold.

The local redundancy of an internal node in a code tree is \(r_k = q_k(1-h_k)\).

An internal node is called balanced if \(\tilde p_{k,j} = \frac{q_k}{D}\). In this case \(r_k = 0\). \(h_k\) measures the amount of information that the next symbol to be revealed carries given that the symbols already revealed led to internal node \(k\).

Local redundancy theorem: \(R = \sum_k r_k\)

5. Weak Typicality

Both weak and strong typicality are formal measures of how typical some outcome is for a sequence of i.i.d. random variables.

5.1 Weak AEP (Weak Asymptotic Equipartition Property)

Weak law of Large Numbers (WLLN) For i.i.d. random variables \(Y_1, Y_2, \ldots\) with generic random variable \(Y\) $$ \frac{1}{n}\sum_{k=1}^n Y_k → E[Y] $$
Weak AEP I Let \(X = (X_1, X_2, \ldots, X_n)\) where \(X_i\) are i.i.d. random variables. $$ \lim_{n→\infty}\bigg(-\frac{1}{n}\log p(X)\bigg) = H(X) $$ This holds due to \(p(X) = p(X_1)p(X_2)\cdots p(X_n)\), the WLLN above and the fact that \(-E[\log p(X)] = H(X)\). Weak AEP I includes two further equivalent statements.

The weakly typical set \(W_{[X]\epsilon}^n\) with respect to \(p(x)\) is the set of all sequences \(x = (x_1, \ldots, x_n) \in \mathcal{X}^n\) such that $$ \Big|-\frac{1}{n}\log p(x) - H(X)\Big| \le \epsilon $$ The empirical entropy of a sequence \(x = (x_1, \ldots, x_n)\) is $$ -\frac{1}{n} \log p(x) = \frac{1}{n}\sum_{k=1}^n [-\log p(x_k)] $$ Therefore the empirical entropy of a weakly typical sequence is close to the true entropy \(H(X)\).

Weak AEP II For \(\epsilon > 0\) and \(x \in W_{[X]\epsilon}^n\) $$ 2^{-n(H(X)+\epsilon)} \le p(x) \le 2^{-n(H(X)-\epsilon)} $$ which follows directly from the definition of \(W_{[X]\epsilon}^n\). Furthermore for \(n\) sufficiently large $$ \Pr\{X\in W_{[X]\epsilon}^n\} > 1-\epsilon $$ and $$ (1-\epsilon)2^{n(H(X)-\epsilon)} \le |W_{[X]\epsilon}^n| \le 2^{n(H(X)+\epsilon)} $$ "The idea is that, although the size of the weakly typical set may be insignificant compared with the size of the set of all sequences, the former has almost all the probability. When \(n\) is large, one can almost think of the sequence \(X\) as being obtained by choosing a sequence from the weakly typical set according to a uniform distribution. [..] The most likely sequence is in general not weakly typical although the probability of the weakly typical set is close to 1 when \(n\) is large."

→ By proving a property for the typical set, this property will hold with a high probability in the general case.

5.2 Source Coding Theorem

The encoder is a function \(f\colon \mathcal{X}^n → I = \{1, 2, \ldots, M\}\). Such a code is called a block code with block length \(n\). The coding rate of the code is \(\frac{\log M}{n}\) in bits per source symbol.

For data compression typically \(R < \log |\mathcal{X}|\).

\(P_e = \Pr\{\hat X \ne X\}\) is called the error probability.

The source coding theorem

6. Strong Typicality

6.1 Strong AEP (Strong Asymptotic Equipartition Property)

Let \(X = (X_1, X_2, \ldots, X_n)\) where \(X_i\) are i.i.d. random variables. Assume \(|\mathcal{X}|<\infty\).
Consider \(\mathbf {x}\in \mathcal{X}^n\). Let \(N(x, \mathbf{x})\) be the number of occurrences of \(x\) in the sequence \(\mathbf{x}\). Then \(\frac{N(x, \mathbf{x})}{n}\) is the relative frequency of \(x\) in \(\mathbf{x}\) and \(\{\frac{N(x, \mathbf x)}{n}: x \in \mathcal{X}\}\) is the empirical distribution of \(\mathbf x\).
The strongly typical set \(T^n_{[X]\delta}\) with respect to \(p(x)\) is the set of sequences \(\mathbf x = (x_1, \ldots, x_n) \in \mathcal{X}^n\) such that: $$ N(x, \mathbf x) = 0 \text{ for } x \notin \mathcal{S}_X $$ (any \(x\) not in the support cannot occur in the sequence) and $$ \sum_x \Bigg |\frac{N(x, \mathbf x)}{n} - p(x) \Bigg| \le \delta $$ (the empirical distribution of \(\mathbf x\) is approximately equal to \(p(x)\))
Strong AEP For \(\mathbf x \in T^n_{[X]\delta}\) there exists \(\eta > 0\) such that \(\eta → 0\) as \(\delta → 0\) and the following hold $$ 2^{-n(H(X)+\eta)} \le p(x) \le 2^{-n(H(X)-\eta)} $$ Furthermore for \(n\) sufficiently large $$ \Pr\{X\in T_{[X]\delta}^n\} > 1-\delta $$ and $$ (1-\delta)2^{n(H(X)-\eta)} \le |T_{[X]\delta}^n| \le 2^{n(H(X)+\eta)} $$