According to Shannon, a message x is a random event. Let p(x) be the probability of occurrence of event x. If \(p(x)=0\), this event does not occur; If \(p(x)=1\), this event must occur. When \(p(x) = 0\) or \(p(x) = 1\), information x can be called trivial information or spam information. Therefore, the real mathematical significance of information x lies in its uncertainty, that is \(0<p(x)<1\). Quantitative research on the uncertainty of nontrivial information constitutes all the starting point of Shannon’s theory, this starting point is now called information quantity or information entropy, or entropy for short. Shannon and his colleagues at Bell laboratory considered “bit” as the basic quantitative unit of information. What is “bit”? We can simply understand it as the number of bits in the binary system. However, according to Shannon, the binary system with n digits can express up to \(2^{n}\) numbers. From the point of view of probability and statistics, the probability of occurrence of these \(2^{n}\) numbers is \(\frac{1}{2^{n}}\). Therefore, a bit is the amount of information contained in event x with probability \(\frac{1}{2}\). Taking this as the starting point, Shannon defined the self information I(x) contained in an information x as
3.1 Information Space
According to Shannon, a message x is a random event. Let p(x) be the probability of occurrence of event x. If \(p(x)=0\), this event does not occur; If \(p(x)=1\), this event must occur. When \(p(x) = 0\) or \(p(x) = 1\), information x can be called trivial information or spam information. Therefore, the real mathematical significance of information x lies in its uncertainty, that is \(0<p(x)<1\). Quantitative research on the uncertainty of nontrivial information constitutes all the starting point of Shannon’s theory; this starting point is now called information quantity or information entropy, or entropy for short. Shannon and his colleagues at Bell laboratory considered “bit” as the basic quantitative unit of information. What is “bit”? We can simply understand it as the number of bits in the binary system. However, according to Shannon, the binary system with n digits can express up to \(2^{n}\) numbers. From the point of view of probability and statistics, the probability of occurrence of these \(2^{n}\) numbers is \(\frac{1}{2^{n}}\). Therefore, a bit is the amount of information contained in event x with probability \(\frac{1}{2}\). Taking this as the starting point, Shannon defined the self-information I(x) contained in an information x as
Therefore, one piece of information x contains I(x)-bit information, when \(p(x)=\frac{1}{2}\), then \(I(x)=1\). Equation (3.1) is Shannon’s first extraordinary progress in information quantification. On the other hand, with the emergence of Telegraph and telephone, binary is widely used in the conversion and transmission of information. Therefore, we can assert that without binary, there would be no Shannon’s theory, let alone the current informatics and information age. The purpose of this section is to strictly mathematically deduce and simplify the most basic and important conclusions in Shannon’s theory. First, we start with the rationality of the definition of formula (3.1).
If I(x) is used to represent the self-information of a random event x, the greater the probability of occurrence p(x), the smaller the uncertainty. Therefore, I(x) should be a monotonic decreasing function of probability p(x). If xy is a joint event and is statistically independent, that is, \(p(xy)=p(x)p(y)\), then the self-information amount is \(I(xy)=I(x)+I(y)\). Of course, the self-information amount I(x) is nonnegative, that is \(I(x)\ge 0\). Shannon prove, the self-information I(x) satisfying the above three assumptions must be
where c is a constant. This conclusion can be derived directly from the following mathematical theorems.
Advertisement
Lemma 3.1
If the real function f(x) satisfies the following conditions in interval \([1,+\infty )\):
(i)
\(f(x)\ge 0,\)
(ii)
If \(x< y \Rightarrow f(x) < f(y),\)
(iii)
\(f(xy) = f(x)+ f(y).\)
Then \(f(x)=c\log {x}\), where c is a constant.
Proof
Repeated use condition (iii), then there is
$$f(x^{k})=kf(x), ~k\ge 1$$
for any positive integer k. Take \(x=1\), then the above formula holds if and only if \(f(1)=0\). It can be seen from (ii) that \(f(x)>0\) when \(x>1\). Let \(x>1, y>1\) and \(k\ge 1\) given, you can always find a nonnegative integer n to satisfy
In Lemma 3.1, let \(I(x)=f(\frac{1}{p(x)})\), then f(x) satisfies the condition (i), (ii) and (iii), thus \(I(x)=-c\log {p(x)}\). That is (3.1) holds.
Advertisement
In order to introduce the definition of information space, we use X to represent a finite set of original information, or a countable and additive information set, which is called source state set. It can be an alphabet, a finite number of symbols or a set of numbers. For example, 26 letters in English and 2-element finite field \(\mathbb {F}_{2}\) are commonly used source state sets. Elements in X can be called messages, events, etc., or characters. We often use English capital letters such as X, Y, Z to represent a source state set, and lowercase Greek letters \(\xi , \eta , \ldots \) to represent a random variable in a given probability space.
Definition 3.1
The value space of a random variable \(\xi \) is a source state set X; the probability distribution of characters on X as events is defined as
$$\begin{aligned} p(x)=P\{\xi =x\},~\forall ~ x \in X. \end{aligned}$$
(3.2)
We call \((X, \xi )\) an information space in a given probability space, when the random variable \(\xi \) is clear, we usually record the information space \((X, \xi )\) as X. If \(\eta \) is another random variable valued on X, and \(\xi \) and \(\eta \) obey the same probability distribution, that is
$$\begin{aligned} P\{\xi =x\}=P\{\eta =x\},~\forall ~ x \in X. \end{aligned}$$
Call two information spaces \((X, \xi )=(X, \eta )\), usually recorded as X.
As can be seen from Definition 3.1, an information space X constitutes a finite complete event group, that is, we have
It should be noted that if there are two random variables \(\xi \) and \(\eta \) with values on X, when the probability distributions obeyed by \(\xi \) and \(\eta \) are not equal, then \((X, \xi )\) and \((X, \eta )\) are two different information spaces; at this point, we must distinguish the two different information spaces with \(X_{1}=(X, \xi )\) and \(X_{2}=(X, \eta )\).
Definition 3.2
X and Y are two source state sets, and the random variables \(\xi \) and \(\eta \) are taken on X and Y, respectively; if \(\xi \) and \(\eta \) are compatible random variables, the probability distribution of joint event \(xy(x \in X, y \in Y)\) is defined as
$$\begin{aligned} p(xy)=P\{\xi =x, \eta =y \},~\forall ~ x \in X, ~ y \in Y. \end{aligned}$$
(3.4)
Then, we call the joint event set
$$\begin{aligned} XY=\{xy| x \in X, y \in Y\} \end{aligned}$$
Together with the corresponding random variables \(\xi \) and \(\eta \), it is called the product space of information space \((X, \xi )\) and \((Y, \eta )\), denote as \((XY, \xi , \eta )\), when \(\xi \) and \(\eta \) are clear, they can be abbreviated as \(XY=(XY, \xi , \eta ).\) If \(X=Y\) are two identical source state sets, \(\xi \) and \(\eta \) have the same probability distribution, then the product space XY is denoted as \(X^{2}\) and is called a power space.
Since the information space is a complete set of events, defined by the product information space, we have the following full probability formula and probability product formula:
$$\begin{aligned} p(x)p(y|x)=p(xy), ~\forall ~ x \in X, y \in Y. \end{aligned}$$
Where p(y|x) is the conditional probability of y under the condition of x.
Definition 3.3
Let \(X_{1}, X_{2}, \ldots , X_{n}(n \ge 2)\) be n source state sets, \(\xi _{1}, \xi _{2}, \ldots , \xi _{n}\) be n compatible random variables with values, respectively, in \(X_{i}\), the probability distribution of joint event \(x_{1}x_{2}\cdots x_{n}\) is
$$\begin{aligned} X_{1}X_{2}\cdots X_{n}=\{x_{1}x_{2}\cdots x_{n} |x_{i}\in X_{i},~1\le i \le n \} \end{aligned}$$
are the product of n information spaces, especially when \(X_{1}= X_{2}= \cdots =X_{n}=X\), and each \(\xi _{i}\) has the same probability distribution on X, define \(X^{n}=X_{1}X_{2}\cdots X_{n}\), called the n-th power space of information space X.
Let us give some classic examples of information space.
Example 3.1
(Two point information space with parameter\(\lambda \)) Let \(X=\{0, 1\}=\mathbb {F}_{2}\) be a binary finite field, the random variable \(\xi \) taken on X is subject to the two-point distribution with parameter \(\lambda \), that is
where \(0<\lambda <1\), then \((X, \xi )\) is called a two-point information space with parameter \(\lambda \), still denote as X.
Example 3.2
(Equal probability information space) Let \(X=\{x_{1}, x_{2}, \ldots , x_{n}\}\) be a source state sets, the random variable \(\xi \) on X obeys the equal probability distribution, that is
$$\begin{aligned} p(x)=P\{\xi =x \}=\frac{1}{|X|},~\forall ~ x \in X. \end{aligned}$$
Then \((X, \xi )\) is called equal probability information space, still denote as X.
Example 3.3
(Bernoulli information space) Let \(X_{0}=\{0, 1\}=\mathbb {F}_{2}\). Let the random variable \(\xi _{i}\) be the i-th Bernoulli test; therefore, \(\{\xi _{i}\}{_{i=1}^{n}}\) is a set of independent and identically distributed random variables. We let the product space
(Degenerate information space) If \(X=\{x\}\), it contains only one character. X is called a degenerate information space, or trivial information space. The random variable \(\xi \) takes the value x of probability 1, that is \(P\{\xi =x\}=1.\) At this time, \(\xi \) is a random variable with degenerate distribution in probability.
Definition 3.4
Let \(X=\{x_{1}, x_{2}, \ldots , x_{n}\}\) be a source state sets, if X is an information space, the information entropy H(X) of X is defined as
if \(p(x_{i})=0\) in the above formula, we agreed that \(p(x_{i})\log p(x_{i})=0\), the base of logarithm can be selected arbitrarily; if the base of the logarithm is \(D(D\ge 2)\), then H(X) is called D-ary entropy, sometimes denote as \(H_{D}(X)\).
And \(H(X)=0\) if and only if X is a degenerate information space, \(H(X)= \log |X|\) if and only if X is a equal probability information space.
Proof
\(H(X)\ge 0\) is trivial. We only prove the inequality on the right of Eq. (3.9). Because \(f(x)=\log {x}\) is a strictly convex real value, from the Lemma 1.7 in Chap. 1, thake \(g(x)=\frac{1}{p(x)}\) is a positive function, \(p(x)>0\), thus let \(X=\{x_{1},x_{2}, \ldots , x_{m}\}\),
$$\begin{aligned} H(X)=\sum _{i=1}^{m}p(x_{i})\log {\frac{1}{p(x_{i})}} \le \log {\sum _{i=1}^{m}\frac{p(x_{i})}{p(x_{i})}}=\log m. \end{aligned}$$
The above equal sign holds if and only if \(p(x_{1})=p(x_{2})=\cdots =p(x_{m})=\frac{1}{m}\), that is, X is equal probability information space. If \(X=\{x\}\) is a degenerate information space, because \(p(x)=1\), so \(H(X)=0\). Conversely, if \(H(X)=0\), let \(X=\{x_{1},x_{2}, \ldots , x_{m}\}\), suppose \(\exists ~ x_{i}\in X\), such that \(0<p( x_{i})<1\), then
So there is \(p( x_{i})=1\), but \(p( x_{j})=0(j\ne i)\); at this time, X degenerates into \(X=\{x_{i}\}\), which is a trivial information space, the Lemma holds.
An information space is a dynamic code (which changes with the change of the random variable on it). For “dynamic code”, that is, the code rate of information space X, Shannon replaces \(\frac{1}{n} H(X)\) with information entropy, so information entropy H(X) becomes the first mathematical quantity to describe dynamic code. From Theorem 3.1, when the code is degenerate, the minimum rate of a dynamic code is 0, when the code is equal probability, the maximum rate is the rate of the usual static code.
Next, we discuss the information entropy of several typical information spaces.
Example 3.5
(i) Let X be the two-point information space of parameter \(\lambda \), then
\(H(\lambda )\) we defined it in Chap. 1, it was called binary information entropy function at that time. Now we know why it is called entropy function
(ii)
\(X=\{x\}\) is degraded information space, then \(H(X)=0\).
(iii)
When X is equal overview information space, then \(H(X)=\log |X|\).
Remark Most authors directly regard a random variable as an information space. Mathematically, it is convenient to do so and call it the information measurement of random variables. However, from the perspective of information, using the concept of information space can better understand and simplify Shannon’s theory; the core idea of this theory is the random measurement of information, not the information measurement of random variables.
3.2 Joint Entropy, Conditional Entropy, Mutual Information
Definition 3.5
Let X, Y be two information spaces, and \(\xi , \eta \) be random variables with corresponding values, respectively. If \(\xi \) and \(\eta \) are independent random variables, that is
$$\begin{aligned} P\{\xi =x, \eta =y\}=P\{\xi =x\}\cdot P\{ \eta =y\}, ~\forall ~ x \in X, y \in Y. \end{aligned}$$
X and Y are called independent information space, and the probability distribution of joint events is
$$\begin{aligned} p(xy)=p(x)p(y), ~\forall x \in X, y \in Y. \end{aligned}$$
Definition 3.6
Let X, Y be two information spaces, the information entropy H(XY) of the product space XY is called the joint entropy of X and Y, that is
The above equal sign holds, if and only if for all \(x\in X\), \(y\in Y\), \(\frac{p(x)p(y)}{p(xy)}=c\)( where c is a constant), thus \(p(x)p(y)=cp(xy)\). Both sides sum at the same time, we have
thus \(c=1\), \(p(xy)=p(x)p(y)\). So if and only if X and Y are independent information spaces, (3.14) holds. By induction, we have (3.15) and (3.16). Theorem 3.2 holds.
By (3.15), we have the following direct corollary; for any information space X and \(n \ge 1\), we have
$$\begin{aligned} H(X^{n})\le n H(X). \end{aligned}$$
(3.17)
Definition 3.7
Let X and Y be two information spaces, and say that X is completely determined by Y, if there is always a subset \(N_{x} \subset Y\) of Y for any given \(x\in X\), satisfies
The proposition holds for \(n+1\). So the Lemma holds.
3.3 Redundancy
Select a alphabet \(\mathbb {F}_{q}\) or a remaining class ring \(\mathbb {Z}_{m}\) of module m, each element in the alphabet is called character, and in the field of communication, alphabet is also called source state, and character is also called transmission signal. If the length of a q-ary code is increased, redundant transmission signals or characters will appear in each codeword. The digital measurement of “redundant characters” is called redundancy, which is a technical means to improve the accuracy of codeword transmission, and redundancy is an important mathematical quantity to describe this technical means. Therefore, we start by proving the following lemma.
Let X be a source state set and randomly select codewords to enter the channel of information transmission is a discrete random process. This mathematical model can be constructed and studied on X by taking the value of a group of random variables \(\{\xi _{i}\}_{i \ge 1}\). Firstly, we assume that \(\{\xi _{i}\}_{i \ge 1}\) obeys the same probability distribution when taking value on X, and we get a set of information spaces \(\{X^{i}\}_{i \ge 1}\), let \(H_{0}=\log |X|\) be the entropy of X as the equal probability information space, for \(n \ge 1\), we let
We will extend the above observation to the general case: Let \(\{\xi _{i}\}_{i \ge 1}\) be any set of random variables valued on X, for any \(n \ge 1\), we let
The concept from information space to source changes from a single random variable taking value on X to an infinite dimensional random vector, so that the transmission process of code X constitutes a discrete random process. By definition, we have
Lemma 3.8
Let X be a source state set, and \(\{\xi _{i}\}_{i \ge 1}\) be a set of random variables valued on X, we write
$$\begin{aligned} X_{i}=(X, \xi _{i}),~ i \ge 1. \end{aligned}$$
(3.28)
(i)
If X is a memoryless source, the joint probability distribution on X satisfies
If X is a stationary source, then for all integers \( t_{1}, t_{2}, \ldots , t_{k}(k \ge 1)\) and h, there is the following joint probability distribution,
If X is a stationary Markov source, then the conditional probability distribution on X satisfies for any \(m \ge 1\) and \(x_{1}x_{2}\cdots x_{m} \in X_{1}X_{2}\cdots X_{m}\), we have
Next, we extend the limit formula in memoryless sources revealed by formula (3.27) to general stationary sources. For this purpose, we first prove two lemmas.
Lemma 3.9
Let \(\{f(n)\}_{n\geqslant 1}\) be a sequence of real numbers, which satisfies the following semi countable additivity,
When \(\varepsilon >0\) is given, Nis also given accordingly, the first item of the above formula tends to 0, when \(n\rightarrow \infty \). So for any \(\varepsilon >0\), when \(n>N_0\),
We denote the above common limit as \(H_{\infty }(X)\).
Proof
Because X is a stationary source, for any \(n\geqslant 1\), \(m\geqslant 1\), then the joint event probability distribution of random vector \(\{\xi _{n+1},\xi _{n+2},\ldots ,\xi _{n+m}\}\) on X is equal to the joint probability distribution of random vector \((\xi _1,\xi _2,\ldots ,\xi _m)\); therefore, we have
Let \(f(n)=H(X_1\cdots X_n)\), then \(f(n+m)\leqslant f(n)+f(m)\), so \(\{f(n)\}_{n \geqslant 1}\) is a nonnegative real number sequence with semi countable additive property, by Lemma 3.9, we have
So \(\{H(X_n|X_1X_2 \cdots X_{n-1})\}_{n\geqslant 1}\) is a monotonically decreasing sequence and has a lower bound, so \(\lim \limits _{n\rightarrow \infty }H(X_n|X_1X_2 \cdots X_{n-1})\) exist. Further, by the addition formula of Lemma 3.2,
In information theory, redundancy is used to describe the effectiveness of the information carried by the source output symbol. The smaller the redundancy, the higher the effectiveness of the information carried by the source output symbol, and vice versa.
3.4 Markov Chain
Let X, Y, Z be three information spaces, if there is the following conditional probability formula
Say that X and Y are statistically independent under the given condition of Z.
Definition 3.11
If the information space X and Y are statistically independent under condition Z, X, Y, Z is called a Markov chain, denote as \(X \rightarrow Z\rightarrow Y\).
Theorem 3.5
\(X \rightarrow Z\rightarrow Y\) is a Markov chain if and only if the probability of occurrence of the joint event xzy is
That is \(X \rightarrow Z\rightarrow Y\) is a Markov chain. Similarly, if (3.40) holds, then \(X \rightarrow Z\rightarrow Y\) also is a Markov chain. The Theorem holds.
According to the above Theorem, or by Definition 3.11, obviously, if \(X \rightarrow Z\rightarrow Y\) is a Markov chain, then \(Y\rightarrow Z\rightarrow X \) is also a Markov chain.
Definition 3.12
Let U, X, Z, Y be four information spaces, and the probability of joint event uxzy is
Call U, X, Z, Y a Markov chain, denote as \(U\rightarrow X\rightarrow Z\rightarrow Y\).
Theorem 3.6
If \(U\rightarrow X\rightarrow Z\rightarrow Y\) is a Markov chain, then \(U\rightarrow X\rightarrow Z\) and \(U\rightarrow Z\rightarrow Y\) are also Markov chains.
Proof
Assuming that \(U\rightarrow X\rightarrow Z\rightarrow Y\) is a Markov chain, then
The information coding theory is usually divided into two parts: channel coding and source coding. The so-called channel coding is to ensure the success rate of decoding by increasing the length of codewords. Channel coding, also known as error correction code, is discussed in detail in Chap. 2. Source coding is to compress the data with redundant information to improve the success rate of decoding and recovery after information or data is stored. Another important result of Shannon’s theory is that there are so-called good codes in source coding, which is characterized by fewer codewords as much as possible. To improve the storage space efficiency, and the error of decoding and restoration can be arbitrarily small. Source coding is also called typical code. Shannon first proved the asymptotic bisection property of ‘block code’for memoryless source, and drew the statistical characteristics of typical code from now on. At Shannon’s suggestion, McMillan (1953) and Breiman (1957) also proved a similar asymptotic bisection property for stationary ergodic sources. This is the very famous Shannon–McMillan–Breiman theorem in source coding, which constitutes the core content of modern typical code theory. The main purpose of this section is to strictly prove the asymptotic bisection of memoryless sources, so as to derive the source coding theorem for data compression (see Theorem 3.10). For the more general Shannon–McMillan–Breiman theorem, Chap. 2 of Ye Zhongxing’s fundamentals of information theory (see Zhongxing, 2003 in reference 3) gives a proof under the condition of stationary ergodic Markov source, interested readers can refer to it or refer to more original documents (see McMillan, 1953; Moy, 1961; Shannon, 1959 in reference 3).
Firstly, let \(X=(X,\xi )\) be an information space, and the entropy H(X) of X essentially depends only on the probability function \(p(x)(x\in X)\) of random variable \(\xi \). We can define the random variable taking value on X according to p(x).
Therefore, we can regard the entropy H(X) of X as the mathematical expectation of random variable \(\log \frac{1}{p(X)}\).
Lemma 3.11
Let X be a memoryless source, \(p(X^{n})\) and \(\log p(X^{n})\) be two random variables whose values are on the power space \(X^{n}\), then \(-\frac{1}{n}\log p(X^{n})\) converges to H(X) according to probability, that is
Since X is a memoryless source, \(\{ \xi _{i} \}_{i\ge 1}\) is a group of independent and identically distributed random variables, \(X_{i}=(X,\xi _{i})(i\ge 1)\), \(X^{n}=X_{1}X_{2}\cdots X_{n}(n\ge 1)\) is a power space, then there is
Because \(\{\xi _{i}\}_{i\ge 1}\) is independent and identically distributed, \(\{p(X^{n})\}\) and \(\{\log p(X^{n}\}\) is also a group of independent and identically distributed random variables. According to Chebyshev’s law of large numbers (see Theorem 1.3 of Chap. 1),
Let X be a memoryless source, power space \(X^{n}\), also known as block code,
$$\begin{aligned} X^{n}=\{x=x_{1}\cdots x_{n}|x_{i}\in X,~ 1\le i \le n\},n\ge 1. \end{aligned}$$
(3.54)
For any given \(\varepsilon >0\), \(n\ge 1\), we define a typical code or a typical sequence \(W_{\varepsilon }^{(n)}\) in the power space \(X^{n}\) as
(Progressive bisection) \(|W_{\varepsilon }^{(n)}|\) represents the number of codewords in typical code \(W_{\varepsilon }^{(n)}\), then for any \(\varepsilon >0\), in binary channels, we have
By Lemma 3.12, for memoryless source X, the probability distribution p(x) of its power space \(X^{n}\) is approximate to
$$p(x) \sim 2^{-nH(X)}, ~\forall ~ x\in X^{n}.$$
The number of codewords \(|W_{\varepsilon }^{(n)}|\) in typical code \(W_{\varepsilon }^{(n)}\) is approximately
$$|W_{\varepsilon }^{(n)}|\sim 2^{nH(X)}.$$
Further analysis shows that the proportion of typical code \(W_{\varepsilon }^{(n)}\) in block code \(X^{n}\) is very small, which can be summarized as the following Lemma.
Lemma 3.13
For a sufficiently small \(\varepsilon >0\) given, when X is not an equal probability information space, we have
By Theorem 3.1, since X is not an equal probability information space, when \(\varepsilon \) is sufficient, we have
$$H(X)+\varepsilon <\log |X|.$$
Therefore, when n is sufficiently large, the ratio of \(\frac{|W_{\varepsilon }^{(n)}|}{|X|^{n}}\) can be arbitrarily small. The Lemma 3.13 holds.
Combining Lemmas 3.11, 3.12 and 3.13, we can describe that the typical codes in block codes have the following statistical characteristics.
Corollary 3.5
Assuming that X is a memoryless source and the typical sequence (or typical code) \(W_{\varepsilon }^{(n)}\) in block code \(X^{n}\) is defined by formula (3.55), then for any \(\varepsilon >0\), \(n\ge 1\), we have
The above description of the statistical characteristics of typical codes is an important theoretical basis for source coding or data compression. Therefore, we find an effective way to compress the packet code information, so that the rearranged codewords are as few as possible, and the error probability of decoding and recovery is as small as possible. An effective method is to divide the codeword in block code \(X^{n}\) into two parts; the codeword of typical code \(W_{\varepsilon }^{(n)}\) is uniformly numbered from 1 to M. That is, the codeword in \(W_{\varepsilon }^{(n)}\) forms one-to-one correspondence with the following positive integer set I,
For codewords that do not belong to \(W_{\varepsilon }^{(n)}\), we uniformly number them as 1: Obviously, for \(i,i\ne 1,1\le i\le n\), there is a unique codeword \(x^{(i)}\in W_{\varepsilon }^{(n)}\) in \(W_{\varepsilon }^{(n)}\), so we can accurately restore i to \(x^{(i)}\), that is \(i{\mathop {\longrightarrow }\limits ^{\text {decode}}}x^{(i)}\) is the correct decoding. For \(i=1\), we will not be able to decode correctly, resulting in decoding recovery error. We denote the code rate of the typical code \(W_{\varepsilon }^{(n)}\) as \(\frac{1}{n}\log M\), by Lemma 3.12,
From this, we derive the main result of this section, the so-called source coding theorem.
Theorem 3.10
(Shannon, 1948) Assuming that X is a memoryless source, then
(i)
When the code rate \(R=\frac{1}{n}\log M_{1}>H(X)\), there is an encoding with the code rate of R, so that when \(n\rightarrow \infty \), the error probability of decoding recovery is \(P_{e}\rightarrow 0\).
(ii)
When the code rate \(R=\frac{1}{n}\log M_{1}<H(X)-\delta \), \(\delta >0\) and does not change with \(n\rightarrow \infty \), then any coding with R as the code rate has \(\lim \limits _{n\rightarrow \infty }P_{e}=1\).
Proof
The above analysis has given the proof of (i). In fact, if
$$R=\frac{1}{n}\log M_{1}>H(X),$$
then when \(\varepsilon \) is sufficiently small, by (3.59). Typical codes in block code \(X_{n}\) are
Therefore, we construct a code \(C\subset X^{n}\), which satisfies
$$W_{\varepsilon }^{(n)}\subset C, |C|=M_{1}.$$
Thus, the code rate of C is just equal to R, and the decoding error probability \(P_{e}(C)\) after compression coding satisfies \(P_{e}(C)<\varepsilon \). Because the probability of occurrence of C
Let X be a source state set, \(x=x_1x_2\cdots x_n\in X^n\) be a message sequence, and x be output as a codeword \(u=u_1u_2\cdots u_k\in \mathbb {Z}_D^k\) of length k after compression coding, where \(D \ge 1\) is a positive integer, \(\mathbb {Z}_D\) is the remaining class ring of \({{\,\mathrm{mod}\,}}D\), \(u=u_1u_2\cdots u_k\in \mathbb {Z}_D^k\) is called a D- ary codeword of length k. u is decoded and translated into message x, that is \(u\rightarrow x\). The purpose of source coding is to find a good coding scheme to make the code rate as small as possible under the requirement of sufficiently small decoding error. Below, we give the strict mathematical definitions of equal length code and variable length code.
Definition 3.14
Let X be a source state set, \(\mathbb {Z}_D\) is the remaining class ring of \({{\,\mathrm{mod}\,}}D\), n, k are positive integers. The mapping \(f: X^n\rightarrow \mathbb {Z}_D^k\) is called equal length code coding function; \(\mathbb {Z}_D^k{\mathop {\longrightarrow }\limits ^{\psi }}X^n\) is called the corresponding decoding function. For \(\forall ~x=x_1\cdots x_n\in X^n\), \(f(x)=u=u_1\cdots u_k\in \mathbb {Z}_D^k\), \(u=u_1\cdots u_k\) is called a codeword of length k.
Call C is the code coded by f, and \(R=\frac{k}{n}logD\) is the coding rate of f, also known as the code rate of C. C is called equal length code; it is sometimes called a block code with a packet length of k.
By Definition 3.14, the error probability of an equal length code coding scheme \((f, \psi )\) is
Let us first consider error free coding, that is \(P_e=0\). Obviously, \(P_e=0\) if and only if f is a injection, \(\psi =f^{-1}\) is the left inverse mapping of f. select a coding function \(f:X^n\rightarrow \mathbb {Z}_D^k\) as a injection if and only if \(|\mathbb {Z}_D^k|\ge |X^n|\), that is \(D^k\ge N^n\), where \(N=|X|\), take logarithms on both sides,
Therefore, the code rate of error free compression coding f is at least \(\log _2|X|\) bits or \(\ln |X|\) naits.
We consider progressive error free coding, that is, for any given \(\varepsilon >0\), required decoding error probability \(P_e\le \varepsilon \). By Theorem 3.10, only the code rate \(R\ge H(X)\) is needed. In fact, take X as an information space and encode the n-lengthen message column \(x=x_1x_2\cdots x_n\in X^n\), if \(x\in W_{\varepsilon }^{(n)}\) is a typical sequence (typical code), x corresponds to a number in \(M=|W_{\varepsilon }^{(n)}|\), if \(x\notin W_{\varepsilon }^{(n)}\), uniformly code x as 1. If the M codewords in \(W_{\varepsilon }^{(n)}\) are represented by D-ary digits, let \(D^k=M\) (the insufficient part can be supplemented), and the code rate R is
$$R=\frac{1}{n}\log M=\frac{k}{n}\log D.$$
Since M is approximately \(2^{nH(X)}\), R is approximately H(X), that is \(R=\frac{1}{n}\log M\sim H(X)\). From the asymptotic bisection, the error probability of such coding is
$$P_e=P\{x=x_1\cdots x_n\notin W_{\varepsilon }^{(n)} \}<\varepsilon , \ \text{ When } \text{ n } \text{ is } \text{ sufficiently } \text{ large. }$$
However, in practical application, n cannot increase infinitely, which requires us to find the best coding scheme when given a finite n, so that the code rate is as close as possible to the theoretical value H(X). However, in application, we find that equal length code is not an efficient coding scheme, while variable length code is more practical. For example,
Example 3.6
Let \(X=\{1, 2, 3, 4 \}\) be an information space, and the probability distribution of random variable \(\xi \) taking value on X is
If equal length code is used for coding, the code length is 2, and the code is
×
Then the code rate \(R (k=2, n=1)\) is
$$ R=2\log _2 2=2>1.75 \text{ bits }. $$
Obviously, the use efficiency of equal length codes is not high. If the above codes are replaced with unequal length codes, such as We use l(x) to represent the code length after the source letter x is encoded, then the average code length L required for X encoding is
It can be seen that using unequal length code to compile X has higher efficiency. This example also explains the following compression coding principle: for characters with high probability of occurrence, a shorter codeword is prepared, and for characters with low probability of occurrence, a longer codeword is prepared to ensure that the average coding length is as small as possible.
×
Next, we give the mathematical definition of variable length coding. For this purpose, let \(X^*\) and \(\mathbb {Z}_D^*\) be the set of finite length sequences, respectively. That is \(X^*=\bigcup _{1 \le k < \infty }X^k\).
Definition 3.15
(i)
\(X^n{\mathop {\longrightarrow }\limits ^{f}}\mathbb {Z}_D^*\) is called a variable length code function, if any \(x\in X^n\), \(f(x)\in \mathbb {Z}_D^*\), When x is different, the code length of f(x) is not necessarily the same. We use l(x) to table the length of f(x), which is called the coding length of x. \(C=\{f(x)\in \mathbb {Z}_D^* | x\in X^n \}\) is called variable length codeword set.
(ii)
Let \(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) be a amapping, call f is a coding mapping, \(f(X^*)\) is called a code.
(iii)
\(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) is called a block code mapping, if there is a mapping \(g: X{\longrightarrow }\mathbb {Z}_D^*\), so that for any \(x \in X^n (n \ge 1)\), write \(x=x_1x_2\cdots x_n\), there is \(f(x)=g(x_1)g(x_2)\cdots g(x_n)\).
(iv)
\(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) is called a uniquely decodable map, if f is a block code mapping and f is a injection.
(v)
\(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) is called a real-time code mapping. If f is a block code mapping, and for any \(x, y \in X^*\), f(x) and f(y) cannot be prefixes to each other.
Remark 3.1
\(a=a_1a_2\cdots a_n \in \mathbb {Z}_D^n, b=b_1b_2\cdots b_m \in \mathbb {Z}_D^m\), call codeword a the prefix of b, if \(m\ge n\), and for any \(1 \le i \le n\), there is \(a_{i}=b_{i}\).
Lemma 3.14
Block code mapping \(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) is called a uniquely decodable mapping if and only if for \(\forall ~n \ge 1, X^n{\longrightarrow }\mathbb {Z}_D^*\),f is restricted to a injection on \(X^n\).
Proof
The necessity is obvious and the adequacy is proved. That is to prove for \(\forall ~x=x_1x_2\cdots x_n \in X^n, y=y_1y_2\cdots y_m \in X^m, x \ne y,\) there is \(f(x)\ne f(y)\). Suppose there is \(f(x)=f(y)\), because f is a block code mapping, there is a mapping \(g: X{\longrightarrow }\mathbb {Z}_D^*\), we have
But \(xy \ne yx\), this contradicts the fact that f is restricted to a injection on \(X^{n+m}\).
Lemma 3.15
A real-time code is uniquely decodable, and vice versa.
Proof
Suppose \(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) as an instant code mapping, and for \(x, y \in X^*, x \ne y,\) there is \(f(x)=a_1a_2\cdots a_n \in \mathbb {Z}_D^n, f(y)=b_1b_2\cdots b_m \in \mathbb {Z}_D^m (m \ge n)\). Because f(x) is not a prefix of f(y), it exists \(i (1\le i \le n)\), there is \(a_i \ne b_i\), thus \(f(x) \ne f(y)\), that is f is an injection. In turn, let us take a counter example,
×
where \(X=\{1, 2, 3, 4\}\) is the information space and \(f: X\rightarrow \mathbb {Z}_2^*\) is a variable length code. f(1) is the prefix of f(2), that is, f is not a real-time code map, but obviously f is the only decodeable map. The Lemma holds.
What are the conditions for the code length of a real-time code? The following Kraft inequality gives a satisfactory answer.
Lemma 3.16
For the uniquely decodable code C value in \(\mathbb {Z}_D^*\), \(|C|=m\), the code lengths are \(l_1, l_2, \ldots , l_m\), then there is the following McMillan–Kraft inequality.
On the contrary, if \(l_i\) satisfies the above conditions, there is a code length set of real-time code C such that \(\{l_1, l_2, \ldots , l_m \}\) is C.
the form of each item is \(D^{-l_{i_1}-l_{i_2}-\cdots -l_{i_n}}=D^{-k}\), where \(l_{i_1}+l_{i_2}+\cdots +l_{i_n}=k\). Suppose \(l=\max \{l_1, l_2, \ldots , l_m\}\), then the range of k is from n to nl. Define the number of items where \(N_{k}\) is \(D^{-k}\), then
The codeword is still in \(\mathbb {Z}_D^*\), and because \(f: X^*{\longrightarrow }\mathbb {Z}_D^*\) is an injection, so \(N_{k}\le D^{k}\). then we have
If \(x\ge 1\), and when n Is Sufficiently Large, \(x^n >nl\). But the above formula holds for all arbitrary n. That is \(\sum _{i=1}^m D^{-l_i}\le 1.\)
On the contrary, assuming that Kraft inequality exists, that is, there is a given length \(l_i (1\le i \le m)\) satisfying formula (3.66), now we need to construct a real-time code with these lengths, and \(l_i (1\le i \le m)\) may not be completely different. Definition \(n_j\) is the number of codewords with length j, if \(l=\max \{l_1, l_2, \ldots , l_m\}\), then
Because \(n_{1}\le D\), we can choose these \(n_{1}\) codes arbitrarily, and the remaining \(D-n_{1}\) codes with length 1 can be used as the prefix of other codewords. Therefore, there are \((D-n_{1})D\) options for codewords with length of 2. That is \(n_{2}\le D^{2}-n_1D\). Similarly, \((D-n_{1})D-n_{2}\) codewords can be used as prefixes of subsequent codewords. Therefore, there are at most \(((D-n_{1})D-n_{2})D\) options for codewords with length of 3. That is \(n_{3}\le D^{3}-n_1D^{2}-n_2D\). \(\ldots \), in this way, we can always construct a real-time code with length \(\{l_1, l_2, \ldots , l_m\}\). The Lemma holds!
Let us give an example that is not the only one that can be decoded.
Example 3.7
Let \(X=\{1, 2, 3, 4 \}\), \(\mathbb {Z}_D=\mathbb {F}_2\), the coding scheme is
×
Because the encoder inputs and the decoder receives continuous codeword symbols, if the character received by the decoder is 001101, there may be two decoding results, 112212 and 3412. This shows that \(f^*\) is not an injection, that is, the code written by f is not uniquely decodable.
By Lemma 3.16, real-time codes or, more generally, uniquely decodable codes must satisfy Kraft inequality. However, the variable length code compiled according to kraft inequality is not the optimal code, because from the perspective of random coding, an optimal code not only requires the accuracy of decoding, but also ensures the efficiency, that is, the average random code length requires the shortest. We summarize the strict mathematical definition of the optimal code as.
Definition 3.16
Let \(X=\{x_1, x_2, \ldots , x_m\}\) is an information space, a real-time code \(C=\{f(x_1), f(x_2),\ldots , f(x_m) \}\) is called an optimal code if its average random code length
is the smallest, where \(p_i=p(x_i)\) is the occurrence probability of \(x_i\) and \(l_i\) is the code length of \(x_i\).
For a source state set X, when its statistical characteristics are determined, that is, after X becomes an information space, the probability distribution \(\{p(x)| x\in X \}\) is given. Therefore, to find the optimal compression coding scheme for an information space X is to find the optimal solution \(\{l_1, l_2, \ldots , l_m\}\) of (3.67) under the condition of kraft inequality. Usually, we use the Lagrange multiplier method to find the optimal solution. Let
That is, L is the D-ary information entropy \(H_D(X)\) of X. from this, we get the main results of this section.
Theorem 3.11
The average length L of any D-ary real-time code in an information space X shall satisfies
$$ L\ge H_D(X). $$
The equal sign holds if and only if \(p_i = D^{-l_i}\).
Next, we will give another proof of Theorem 3.11. Therefore, we consider that there are two random variables \(\xi \) and \(\eta \) on a source state set X, and their probability distributions are
By Theorem 3.11, coding according to probability, then the code length of D-ary optimal code is
$$ l_i=\log _D\frac{1}{p_i}, ~1\le i\le m. $$
But in general, \(\log _D\frac{1}{p_i}\) is not an integer, we use \(\left\lceil a \right\rceil \) to represent the smallest integer not less than the real number a. Take
$$\begin{aligned} l_i=\left\lceil \log _D\frac{1}{p_i}\right\rceil , ~1\le i\le m. \end{aligned}$$
Then the code length defined by formula (3.72) is \(\{l_1, l_2, \ldots , l_m\}\) and satisfies Kraft inequality. From Lemma 3.16, we can define the corresponding real-time code.
Definition 3.17
Let \(X=\{x_1, x_2, \ldots , x_m \}\) be an information space, \(p_i=p(x_i)\),
$$ l(f(x_i))=l_i=\left\lceil \log _D\frac{1}{p_i}\right\rceil , ~1\le i\le m. $$
Then the real-time code corresponding to \(\{l_1, l_2, \ldots , l_m\}\) is called Shannon code.
Corollary 3.6
The code length \(l(f(x_i))\) of a Shannon code \(C=\{f(x_i)|1\le i\le m\}\) satisfies
In variable length codes, in order to make the average code length as close to the source entropy as possible, the code length should match the occurrence probability of the corresponding coded characters as much as possible. The principle of probabilistic coding is that the characters with high occurrence probability are configured with short codewords, and the characters with low occurrence probability are configured with long codewords, So as to make the average code length as close to the source entropy as possible. This idea has existed long before Shannon theory. For example, Morse code invented in 1838 uses three symbols of dot, dash and space to encode 26 letters in English. It is expressed in binary, one dot is 10, a total of 2 bits, one dash is 1110, a total of 4 bits and the space is 000. There are three bits in total. For example, the commonly used English letter E is represented by a dot, while the infrequently used letter Q is represented by two dashes, one dot and one dash, which can make the average length of the codeword of the English text shorter. However, Morse code does not completely match the occurrence probability, so it is not the optimal code, and it is basically not used now. The following table is the coding table of Morse code (Fig. 3.1)
×
It is worth noting that Morse code appeared as a kind of password in the early stage, which is widely used in the transmission and storage of sensitive politics (such as military intelligence). The early cryptosystem compilers were also manufactured based on the principle of Morse code, which quickly mechanized the compilation and translation of passwords. In this sense, Morse code has played an important role in promoting the development of cryptography.
3.7.2 Huffman Codes
Shannon, Fano and Huffman have all studied the coding methods of variable length codes, among which Huffman codes have the highest coding efficiency. We focus on the coding methods of Huffman binary and ternary codes.
Let \(X=\{x_1, x_2,\ldots , x_m \}\) be the source letter set of m symbols, arrange the m symbols in the order of occurrence probability, take the two letters with the lowest probability to prepare the numbers “0” and “1,” respectively, then add their probabilities as a new letter and rearrange them in the order of probability with the source letters without binary numbers. Then take the two letters with the lowest probability to prepare the numbers “0” and “1,” respectively, add the probabilities of the two letters as the probability of a new letter, and re queue; continue the above process until the probability of the remaining letters is added to 1. At this time, all source letters correspond to a string of “0” and “1,” and we get a variable length code, which is called Huffman code. Taking \(X=\{1, 2, 3, 4, 5\}\) as the information space as an example, the corresponding probability distribution is
respectively. The binary Huffman coding diagram of X is (Fig. 3.2).
×
The ternary Huffman coding diagram of X is (Fig. 3.3).
×
In summary, Huffman code has the following characteristics. Assuming that the occurrence probability of the i-th source letter is \(p_i\) and the corresponding code length is \(l_i\), then
(1)
If \(p_i>p_j\), then \(l_i\le l_j\), that is, the source letter with low probability has a longer codeword;
(2)
The longest two codewords have the same code length;
(3)
The codeword letters of the two longest codewords are only different from the last letter, and the front ones are the same;
(4)
In real-time codes, the average code length of Huffman code is the smallest. In this sense, Huffman code is the optimal code.
Huffman code has been applied in practice, which is mainly used in the compression standard of fax image. However, in the actual data compression, the statistical characteristics of some sources change before and after. In order to make the statistical characteristics based on the coding adapt to the changes of the actual statistical characteristics of the source, an adaptive coding technology has been developed. In each step of coding, the coding of a new message is based on the statistical characteristics of previous messages. For example, R. G. Gallager first proposed the step-by-step updating technology of Huffman code in 1978, and D.E. Knuth made this technology a practical algorithm in 1985. Adaptive Huffman coding technology requires complex data structure and continuous updating of codeword set according to the statistical characteristics of source, We would not go into details here.
3.7.3 Shannon–Fano Codes
Shannon–Fano code is an arithmetic code. Let X be an information space. It can be inferred from Corollary 3.6 in the previous section that the code length of Shannon code on X is
Here, we introduce a constructive coding method using cumulative distribution function to allocate codewords, commonly known as Shannon–Fano coding method. Without losing generality, let each letter x in X, there is \(p(x)>0\), and define the cumulative distribution function F(x) and the modified distribution function \(\bar{F}(x)\) as
where \(X=\{1, 2, \ldots , m \}\) is a given information space. Without losing generality, let \(p(1)\le p(2)\le \cdots \le p(m)\).
As can be seen from the definition, if \(x\in X\), then \(p(x)=F(x)-F(x-1)\), specially, if \(x, y\in X\), then we have
$$ \bar{F}(x)\ne \bar{F}(y). $$
So when we know \(\bar{F}(x)\), we can find the corresponding x. The basic idea of Shannon–Fano arithmetic code is to use \(\bar{F}(x)\) to encode x. Because \(\bar{F}(x)\) is a real number, its binary decimal represents the first l(x) bits, denote as \(\{\bar{F}(x)\}_{l(x)}\), there is
This is contrary to the fact that \(\bar{F}(x)\) and \(\bar{F}(y)\) are in the same interval. Therefore, we have f as real-time code, that is, Shannon–Fano code is real-time code. Considering its average code length L,
Let \(n\ge 1\), \(X^n\) is the power space of the information space, \(x=x_1\cdots x_n\in X^n\) is called a message column of length n. In order to improve the coding efficiency, it is often necessary to compress the power space \(X^n\), which is called arithmetic coding. Shannon–Fano code can also be used as arithmetic coding. Its basic method is to find a fast algorithm for calculating joint probability distribution \(p(x_1x_2\cdots x_n)\) and cumulative distribution function F(x), and then use Shannon–Fano method to encode \(x=x_1\cdots x_n\). We will not introduce the specific details here.
3.8 Channel Coding Theorem
Let X be the input alphabet and Y the output alphabet, and let \(\xi \) and \(\eta \) be two random variables with values on X and Y. The probability functions p(x) and p(y) of X and Y and the conditional probability function p(y|x) are
If X and Y are finite sets, the conditional probability matrix \(T=(p(y|x))_{|X|\times |Y|}\) is called the transition probability matrix from X to Y, i.e.,
where \(|X|=M\), \(|Y|=N\). By (3.78), each row of the transition probability matrix T is added to 1.
Definition 3.18
(i)
A discrete channel is composed of a finite information space X as the input alphabet, a finite information space Y as the output alphabet, and a transition probability matrix T from X to Y, denote that this discrete channel is \(\{X, T, Y\}\). If \(X=Y=\mathbb {F}_q\) is q -element finite field, then \(\{X, T, Y\}\) is a discrete q-ary channel. In particular, if \(q=2\), then \(\{X, T, Y\}\) is called discrete binary channel.
(ii)
If \(\{X, T, Y\}\) is a discrete q-ary channel and \(T=I_q\) is the q-order identity matrix, \(\{X, I_q, Y\}\) is called a noise free channel.
(iii)
If \(\{X, T, Y\}\) is a discrete q-ary channel and \(T=T'\) is a q-order symmetric matrix, \(\{X, T, Y\}\) is called a symmetric channel.
In discrete channel \(\{X, T, Y\}\), codeword spaces \(X^n\) and \(Y^n\) with length n are defined as
then X and Y become a memoryless source, \(X^n\) and \(Y^n\) are power spaces, respectively.
Definition 3.19
Discrete channel \(\{X, T, Y\}\) is called a memoryless channel if for any positive integer \(n\ge 1\), \(x=x_1\cdots x_n\in X^n\), \(y=y_1\cdots y_n\in Y^n\), we have
The above formula shows that in a memoryless channel, the conditional probability \(p(y_i|x_i)\) does not depend on \(y_i\).
Definition 3.19 is the statistical characteristic of a memoryless channel. The following lemma gives a mathematical characterization of a memoryless channel.
Lemma 3.19
A discrete channel \(\{X, T, Y\}\) is a memoryless channel if and only if the product information space XY is a memoryless source, and a power space \((XY)^n=X^nY^n\).
Proof
If XY is a memoryless source (see Definition 3.9), thn for any \(n\ge 1\), and \(x=x_1\cdots x_n\in X^n\), \(y=y_1\cdots y_n\in Y^n\), \(xy\in X^{n}Y^{n}\), there is
\(p(x_iy_i)=p(x_1y_1)\) is given by the definition of memoryless source, so \(\{X, T, Y\}\) is a memoryless channel. Conversely, if \(\{X, T, Y\}\) is a memoryless channel, by (3.81), there are
$$ p(xy)=\prod _{i=1}^np(x_iy_i) $$
and \(p(x_iy_i)=p(x_1y_1)\), then for any \(a=a_1a_2\cdots a_n\in (XY)^n\), where \(a_i=x_iy_i\), we have
and \(p(a_i)=p(a_1)\), therefore, XY is a memoryless source, that is, a group of independent and identically distributed random vectors \(\xi =(\xi _1, \xi _2, \ldots , \xi _n, \ldots )\) take value on XY, and \((XY)^n=X^nY^n\) is called power space. The Lemma holds.
The following lemma further characterizes the statistical characteristics of a memoryless channel.
Lemma 3.20
If \(\{X, T, Y\}\) is a discrete memoryless channel, the conditional entropy \(H(Y^n|X^n)\) and information \(I(X^n, Y^n)\) of information space \(X^n\) and \(Y^n\) satisfy \(\forall ~ n\ge 1\),
Let us define the channel capacity of a discrete channel, this concept plays an important role in channel coding. First, we note that the joint probability distribution p(xy) in the product space XY is uniquely determined by the probability distribution p(x) on X and the probability transformation matrix T, that is \(p(xy)=p(x)p(y|x)\); therefore, the mutual information I(X, Y) of X and Y is also uniquely determined by p(x) and T. In fact,
where formula (3.84) is the maximum of all probability distributions p(x) on X.
Lemma 3.21
The channel capacity B of a discrete memoryless channel \(\{X, T, Y\}\) is estimated as follows:
$$ 0\le B\le \min \{\log |X|,\ \log |Y| \}. $$
Proof
The amount of mutual information between the two information spaces is \(I(X, Y)\ge 0\) (see Lemma 3.5), so there is \(B\ge 0\). By Lemma 3.4,
$$ I(X, Y)=H(X)-H(X|Y)\le H(X)\le \log |X| $$
and
$$ I(X, Y)=H(Y)-H(Y|X)\le H(Y)\le \log |Y|, $$
so we have
$$ 0\le B\le \min \{\log |X|,\ \log |Y| \}. $$
The calculation of information capacity is a problem of solving the conditional extremum of constrained convex function. We will not discuss it in detail here but calculate its channel capacity for two simple channels.
Example 3.8
The channel capacity of noiseless channel \(\{X, T, Y\}\) is \(B=\log |X|\).
Proof
Let \(\{X, T, Y\}\) be a noise free channel, then \(|X|=|Y|\), and the probability transfer matrix T is the identity matrix, so
Let a be the random variable in the input space \(\mathbb {F}_2\) and b be the random variable in the output space \(\mathbb {F}_2\), all of which obey the two-point distribution, and then the transfer matrix T of the symmetric binary channel can be represented by the following clearer schematic diagram:
In order to state and prove the channel coding theorem, we introduce the concept of joint typical sequence. By the Definition 3.13 of Sect. 5 this chapter, if X is a memoryless source, for any small \(\varepsilon >0\) and positive integer \(n\ge 1\), in the power space \(X^n\), we define the typical sequence \(W_{\varepsilon }^{(n)}\) as
If \(\{X, T, Y\}\) is a memoryless channel, by Lemma 3.19, XY is a memoryless source, in the power space \((XY)^n=X^nY^n\), we define the joint canonical sequence \(W_{\varepsilon }^{(n)}\) as (Fig. 3.4)
(Progressive bisection) In memoryless channel \(\{X, T, Y\}\), the joint typical sequence \(W_{\varepsilon }^{(n)}\) satisfies the following asymptotic bisection properties:
The following lemma has important applications in proving the channel coding theorem. In fact, the conclusion of lemma is valid in general probability space.
Lemma 3.23
In memoryless channel \(\{X, T, Y\}\), if codeword \(y\in Y^n\) is uniquely determined by \(x\in X^n\), \(x'\in X^n\), \(x'\) and x are independent, y and \(x'\) are also independent.
Proof
If y is uniquely determined by x, then \(p(x)=p(y)=p(xy)\), or \(p(y|x)=1\). Therefore, the probability of joint event \(yxx'\) is
$$ p(yxx')=p(xx')=p(x)p(x')=p(y)p(x'). $$
on the other hand,
$$ p(yxx')=p(yx'). $$
Thus
$$ p(yx')=p(y)p(x'). $$
The Lemma holds.
In order to define the error probability of channel transmission, we first introduce the workflow of channel coding. After source compression coding, a source message input set is generated,
Injection \(f:W\rightarrow X^n\) is called coding function, f encodes each input message \(w\in W\) as \(f(w)\in X^n\). Codeword \(x=f(w)\in X^n\) receives codeword \(y\in Y^n\) after transmission through channel \(\{X, T, Y\}\), we write \(x{\mathop {\longrightarrow }\limits ^{T}}y\), or \(y=T(x)\). Mapping \(g: Y^n\rightarrow W\) is called decoding function. Therefore, the so-called channel coding is a pair of mapping (f, g). Obviously,
$$ C=f(W)=\{f(w)|w\in W\}\subset X^n $$
is a code with length n in codeword space \(X^n\), number of codewords is \(|C|=|W|=M\). C is the code of f. The code rate \(R_C\) is
$$ R_C=\frac{1}{n}\log |C|=\frac{1}{n}\log M. $$
For each input message \(w\in W\), if \(g(T(f(w)))\ne w\), it is said that the channel transmission is wrong, the transmission error probability \(\lambda _w\) is
$$\begin{aligned} \lambda _w=P\{g(T(f(w)))\ne w \},\ \ w\in W. \end{aligned}$$
(3.86)
The transmission error probability of codeword \(x=f(w)\in C\) is recorded as \(P_e(x)\), obviously, \(P_e(x)=\lambda _w\), that is, \(P_e(x)\) is the conditional probability
Conversely, if the transmission error probability of code \(C_n=(n, 2^{[nR]})\) satisfies Eq. (3.89), there is an absolute normal number \(N_0\), and we have the code rate \(R_{C_n}\) of \(C_n\) satisfies
$$ R_{C_n}\le B, \ \text{ when } n\ge N_0. $$
If \(C_n=(n, 2^{[nR]})\), by Lemma 2.27 of Chap. 2,
$$\begin{aligned} R-\frac{1}{n}<R_{C_n}\le R. \end{aligned}$$
(3.90)
so (i) of Theorem 3.12 indicates that the code rate is sufficiently close to the channel capacity B, the “good code” with sufficiently small transmission error probability exists. (ii) indicates that the bit rate of the so-called good code with sufficiently small transmission error probability does not exceed the channel capacity. Shannon’s proof Theorem 3.12 uses random code technology; this idea of using random method to prove deterministic results is widely used in information theory. At present, it has more and more applications in other fields.
Proof
(Proof of theorem 3.12) Firstly, the probability function \(p(x_i)\) is arbitrarily selected on the input alphabet X, and the joint probability in power space \(X^n\) is defined as
In this way, we get a memoryless source X and power space \(X^n\), which constitute the codeword space of channel coding. Then \(M=2^{[nR]}\) codewords are randomly selected in \(X^n\) to obtain a random code \(C_n=(n, 2^{[nR]})\). In order to illustrate the randomness of codeword selection, we borrow the source message set \(W=\{1, 2, \ldots , M \}\), where \(M=2^{[nR]}\). For every message w, \(1\le w\le M\), the randomly generated codeword is marked as \(X^{(n)}(w)\). So we get a random code
where \(X^{(n)}(w)=x_1(w)x_2(w)\cdots x_n(w)\in X^n\).
We take \(A_n=\{C_n\}\) as the set of all random codes \(C_n\), which is called the random code set. The average transmission error probability on random code set \(A_n\) is defined as
If you want to prove that for any \(\varepsilon >0\), When n is sufficiently large, \(\bar{P}_e(A_n)<\varepsilon \), then there is at least one code \(C_n\in A_n\) such that \(P_e(C_n)<\varepsilon \), which proves the (i). Therefore, we prove it in two steps.
(1) Principles of constructing random codes and encoding and decoding
We select each message in the source message set \(W=\{1, 2, \ldots , M \}\) with equal probability, that is \(w\in W\), the selection probability of w is
$$ p(w)=\frac{1}{M}=2^{-[nR]},\ w=1, 2, \ldots , M. $$
In this way, W becomes an equal probability information space. For each input message w, it is randomly coded as \(X^{(n)}(w)\in X^n\), where
Codeword \(X^{(n)}(w)\) is transmitted through memoryless channel \(\{X, T, Y \}\) with conditional probability
$$ p(y|X^{(n)}(w))=\prod _{i=1}^np(y_i|x_i(w)) $$
received codeword \(y=y_1y_2\cdots y_n\in Y^n\). The decoding principle of y is: If \(X^{(n)}(w)\) is the only input codeword so that \(X^{(n)}(w)y\) is joint typical, that is \(X^{(n)}(w)y\in W_{\varepsilon }^{(n)}\), then decode \(g(y)=w\); if there is no such codeword \(X^{(n)}(w)\), or there are two or more codewords \(X^{(n)}(w)\) and y are joint typical, y cannot be decoded correctly.
(2) Estimating the average error probability of random code set \(A_n\)
where \(\lambda _w\) is given by Eq. (3.86). Because w is input with equal probability, in other words, w is encoded with equal probability. Therefore, the transmission error probability \(\lambda _w\) of w does not depend on w, that is
$$\lambda _1=\lambda _2=\cdots =\lambda _M.$$
By (3.93), we have \(\bar{P}_e(A_n)=\lambda _1\). To estimate \(\lambda _1\), we define
Obviously, codeword \(X^{(n)}(1)\) and other codewords \(X^{(n)}(i)\), \((i=2, \ldots , M)\) are independent of each other (see 3.91). By Lemma 3.23, \(y=T(X^{(n)}(1))\) and \(X^{(n)}(i)~(i\ne 1)\) also are independent of each other. Then by the property (iii) of Lemma 3.22,
If \(R<I(X, Y)\), then \(I(X, Y)-R-3\varepsilon >0\)(when \(\varepsilon \) is sufficiently small), so when n is large enough, we have \(\bar{P}_e(A_n){<}2\varepsilon \). Due to the channel capacity \(B=\max \{I(X, Y)\}\), we can choose p(x) to make \(B=I(X, Y)\). So when \(R<B\), we have \(\bar{P}_e(A_n)<2\varepsilon \), this completes the proof of (i).
To prove (ii), let’s look at a special case first. If the error probability of \(C=(n, 2^{[nR]})\) is \(P_e(C)=0\), then the bit rate of C is \(R_C<B+\frac{1}{n}\), so when n is sufficiently large, there is \(R_C\le B\).
In fact, because \(P_e(C)=0\), decoding function \(g: Y^n\rightarrow W\) only determines W, there is \(H(W|Y^n)=0\). Because W is equal probability information space, so
$$H(W)=\log |W|=[nR].$$
Using the decomposition of mutual information, there are
$$R_{C}\le B, \text {when { n} is sufficiently large}.$$
The above formula shows that when the transmission error probability is 0, as long as n is sufficiently large, there is \(R_{C}\le B\). Secondly, if the transmission error is allowed, that is, the error probability of \(C_{n}\) is \(P_{e}(C_{n})<\varepsilon \), where \(C_{n}=(n,2^{[nR]})\). Then when n is sufficiently large, we still have \(R_{C_{n}}\le B\).
In order to prove the above conclusion, we note the error probability of random code \(C_{n}\) is
Note that E is uniquely determined by \(Y^{n}\) and W, so \(H(E|WY^{n})=0\), at the same time, E is a binary information space, \(H(E)\le \log 2=1\), there is
$$H(E|Y^{n})\le H(E)\le 1.$$
On the other hand, the random variable \(\xi _{w}\) is only related to \(w\in W\), so
When n is sufficiently large, we obtain \(R_{C_{n}}\le B\), which completes the proof of the theorem.
It can be seen from Example 3.9 that the channel capacity \(B=1-H(p)\) of a binary symmetric channel. Therefore, Theorem 3.12 extends Theorem 2.10 in the previous chapter to a more general memoryless channel; at the same time, it is also proved that the code rate of a good code does not exceed the capacity of the channel.
Exercise 3
1.
The joint probability functions of the two information spaces X and Y are as follows:
Y
X
0
1
0
\(\frac{1}{4}\)
\(\frac{1}{4}\)
1
\(\frac{1}{12}\)
\(\frac{5}{12}\)
Solve H(X), H(Y), H(XY), H(X|Y), H(Y|X), and I(X, Y).
2.
Let \(X_{1},X_{2},X_{3}\) be three information spaces on \(\mathbb {F}_{2}\), Known \(I(X_{1}, X_{2})=0, I(X_{1}, X_{2}, X_{3})=1\), prove:
$$H(X_{3})=1, \text {and}~H(X_{1}X_{2}X_{3})=2.$$
3.
Give an example to illustrate \(I(X,Y|Z)\ge I(X,Y)\).
4.
Can \(I(X,Y|Z)=0\) be derived from \(I(X,Y)=0\)? In turn, can \(I(X,Y|Z)=0\) deduce \(I(X,Y)=0\)? Please prove or give examples.
5.
Let X, Y, Z be three information spaces, prove:
(i)
\(H(XY|Z)\ge H(X|Z);\)
(ii)
\(I(XY,Z)\ge I(X,Z);\)
(iii)
\(H(XYZ)-H(XY)\le H(XZ)-H(X);\)
(iv)
\(I(X,Z|Y)=I(Z,Y|Z)-I(Z,Y)+I(X,Z).\)
It also explains under what conditions the equality sign holds.
6.
Can \(I(X,Y)=0\) deduce \(I(X,Z)=I(X,Z|Y)\)?
7.
Let the information space be \(X=\{0,1,2,\ldots \}\) and the value probability p(n) of random variable \(\xi \) be
$$p(n)=P\{\xi =n\},n=0,1,\ldots .$$
Given the mathematical expectation \(E\xi =A>0\) of \(\xi \), find the maximum probability distribution \(\{p(n)|n=0, 1, \ldots \}\) of H(X) and the corresponding maximum information entropy.
8.
Let the information space be \(X=\{0,1,2,\ldots \}\), and take an example of the random variable \(\xi \) taken from X, so that \(H(X)=\infty \).
9.
Let \(X_{1}=(X,\xi ),X_{2}=(X,\eta )\) be two information spaces and \(\xi \) be a function of \(\eta \), prove \(H(X_{1})\le H(X_{2})\), and explain this result.
10.
Let \(X_{1}=(X,\xi ),X_{2}=(X,\eta )\) be two information spaces and \(\eta =f(\xi )\), prove
(i)
\(H(X_{1})\ge H(X_{2})\), give the conditions under which the equal sign holds.
(ii)
\(H(X_{1}|X_{2})\ge H(X_{2}|X_{1})\), give the conditions under which the equal sign holds.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.