Published in:

Open Access 2022 | OriginalPaper | Chapter

1. Preparatory Knowledge

Author : Zhiyong Zheng

Published in: Modern Cryptography Volume 1

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Modern cryptography and information theory is a branch of mathematics which develops rapidly. Almost all mathematical knowledge, such as algebra, geometry, analysis, probability and statistics, has very important applications in information theory. Especially, some modern mathematical theories, such as algebraic geometry, elliptic curve and ergodic theory, play more and more important roles in coding and cryptography. It can be said that information theory is the most dynamic branch of modern mathematics with wide application, strong intersection. This chapter requires the reader to have a preliminary knowledge of analysis, algebra, number theory and probability statistics.

1.1 Injective

Let $\sigma $ be a mapping of two nonempty sets A to B, denoted as $A{\mathop {\longrightarrow }\limits ^{\sigma }} B$. Generally, the mappings between sets can be divided into three categories: injective, surjective and bijective.

Definition 1.1

Let $\sigma $ be a mapping of two nonempty sets $A\rightarrow B$, we define

(i) $a,b\in A$, if $a\ne b \Rightarrow \sigma (a) \ne \sigma (b)$, call $\sigma $ an injective of $A\rightarrow B$, it is called injective for short.

(ii) If any $b\in B$, there is a $a\in A \Rightarrow \sigma (a)=b$, call $\sigma $ a surjective of $A\rightarrow B$.

(iii) If $A{\mathop {\longrightarrow }\limits ^{\sigma }} B$ is an injective and a surjective, call $\sigma $ a bijective of $A\rightarrow B$.

(iv) Let $1_{A}$ be the identity mapping of $A\rightarrow A$, which is defined as

$$\begin{aligned} 1_{A}(a)=a, \forall ~ a \in A. \end{aligned}$$

(v) Suppose $A{\mathop {\longrightarrow }\limits ^{\sigma }} B{\mathop {\longrightarrow }\limits ^{\tau }} C$ are two mappings, define the product mapping of $\tau $ and $\sigma $, $\tau \sigma : A\rightarrow C$, and define as

$$\begin{aligned} \tau \sigma (a)=\tau (\sigma (a)), ~\forall ~ a \in A. \end{aligned}$$

Obviously, the product of two mappings has no commutativity but has the following associative law.

Property 1

Let $A{\mathop {\longrightarrow }\limits ^{\sigma }} B {\mathop {\longrightarrow }\limits ^{\tau }} C {\mathop {\longrightarrow }\limits ^{\delta }}D$ be three mappings, then we have

$$\begin{aligned} (\delta \cdot \tau )\cdot \sigma =\delta \cdot (\tau \cdot \sigma ). \end{aligned}$$

(1.1)

Proof

It can be verified directly by definition.

If $A{\mathop {\longrightarrow }\limits ^{\sigma }} B$ is a given mappings, obviously, there is

$$\begin{aligned} \sigma 1_{A}=\sigma , ~1_{B}\sigma =\sigma . \end{aligned}$$

(1.2)

The above formula shows that identity mapping plays the role of multiplication identity in the product of mapping.

Definition 1.2

(i) Suppose $A{\mathop {\longrightarrow }\limits ^{\sigma }} B {\mathop {\longrightarrow }\limits ^{\tau }} A$ are two mappings, if $\tau \sigma =1_{A}$, call $\tau $ is a left inverse mapping of $\sigma $, $\sigma $ is a right inverse mapping of $\tau $.

(ii) Let $A{\mathop {\longrightarrow }\limits ^{\sigma }} B {\mathop {\longrightarrow }\limits ^{\tau }} A$, If $\tau \sigma =1_{A}, \sigma \tau =1_{B}$, call $\tau $ is an inverse mapping of $\sigma $. Denote as $\tau =\sigma ^{-1}$.

The essential properties of injective, surjective and bijective between sets are described by the following lemma.

Lemma 1.1

(i) If $A{\mathop {\longrightarrow }\limits ^{\sigma }} B$ has an inverse mapping $B {\mathop {\longrightarrow }\limits ^{\tau }} A$, that is $\sigma \tau =1_{B}$ and $\tau \sigma =1_{A}$, then $\tau $ is unique ( denote as $\tau =\sigma ^{-1}$).

(ii) $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ is an injective if and only if $\sigma $ has a left inverse mapping $B {\mathop {\longrightarrow }\limits ^{\tau }} A$, that is $\tau \sigma =1_{A} $.

(iii) $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ is an surjective if and only if $\sigma $ has a right inverse mapping $B {\mathop {\longrightarrow }\limits ^{\tau }} A$, that is $\sigma \tau =1_{B}$.

(iv) $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ is an bijective if and only if $\sigma $ has an inverse mapping $\tau $, and $\tau $ is unique.

Proof

First of all, prove (i). Let $B {\mathop {\longrightarrow }\limits ^{\tau _{1}}} A$ and $B {\mathop {\longrightarrow }\limits ^{\tau _{2}}} A$ be two inverse mappings of $\sigma $, then we have

$$\begin{aligned} \tau _{1}\sigma =1_{A}, \tau _{2}\sigma =1_{A}, \text {and}~\sigma \tau _{1}=1_{B}, \sigma \tau _{2}=1_{B}, \end{aligned}$$

From (1.2), we have

$$\begin{aligned} \tau _{1}=\tau _{1}1_{B}=\tau _{1}(\sigma \tau _{2})=(\tau _{1}\sigma )\tau _{2}=1_{A}\tau _{2}=\tau _{2}, \end{aligned}$$

so if $\sigma $ has an inverse mapping, then the inverse mapping is unique.

To prove (ii), we note that if $\sigma $ has a left inverse mapping $\tau $, that is $\tau \sigma =1_{A}$, then $\sigma $ must be an injective, because if $a, b\in A, a\ne b$, then we have $\sigma (a)\ne \sigma (b)$. If $\sigma (a)= \sigma (b)\Rightarrow \tau (\sigma (a))= \tau (\sigma (b) )\Rightarrow a=b $, contradiction with $a\ne b$. Conversely, if $A{\mathop {\longrightarrow }\limits ^{\sigma }} B $ is an injective, then for the element $\sigma (a) \in \sigma (A)$ in $\sigma (A) \subset B$, Let $\tau (\sigma (a))=a$, For the elements in the difference set $B\backslash \sigma (A)$, arrange an image randomly, then $B {\mathop {\longrightarrow }\limits ^{\tau }} A$ satisfies $\tau \sigma =1_{A}$. Similarly, we can prove (iii) and (iv), we thus complete the proof.

In many books of information theory, we often confuse injective and bijective, but they are two different concepts in mathematics, which needs attention.

1.2 Computational Complexity

In binary computing environment, the complexity of an algorithm is measured by the number of bit operations. Bit, short for “Binary digit,” is the basic amount of information, one bit represents one digit of binary system, two bits represent two digits of binary system, so what is “bit operation”?

To understand “bit operation” accurately, we start with the b-ary expression of real number. Let $b>1$ be a positive integer, and any nonnegative real number xcan be uniquely expanded into the following geometric series.

$$\begin{aligned} \begin{aligned} x&=\sum _{-\infty <i\le k-1}d_{i}b^{i}\\&=d_{k-1}b^{k-1}+d_{k-2}b^{k-2}+ \cdots +d_{1}b+d_{0}+\sum _{i=1}^{+\infty }d_{-i}b^{-i}, \end{aligned} \end{aligned}$$

(1.3)

where $\forall ~ d_{i}$ satisfies $0 \le d_{i}<b$. So we can express x as

$$\begin{aligned} x=(d_{k-1}d_{k-2}\cdots d_{0}d_{-1}d_{-2}\cdots )_{b}, \end{aligned}$$

(1.4)

where $(d_{k-1}d_{k-2}\cdots d_{0})_{b}$ is called a b-ary integer, $(0.d_{-1}d_{-2}\cdots )_{b}$ is called a b-ary decimal, and

$$\begin{aligned} x=(d_{k-1}d_{k-2}\cdots d_{0})_{b}+(0.d_{-1}d_{-2}\cdots )_{b}. \end{aligned}$$

(1.5)

If $b=2$, then $x=(d_{k-1}d_{k-2}\cdots d_{0}d_{-1}\cdots )_{2}$ is called the binary representation of x. If $b=10$, then

$$\begin{aligned} \begin{aligned} x&=(d_{k-1}d_{k-2}\cdots d_{1}d_{0}d_{-1}d_{-2}\cdots )_{10}\\&=d_{k-1}d_{k-2}\cdots d_{1}d_{0}.d_{-1}d_{-2}\cdots . \end{aligned} \end{aligned}$$

(1.6)

It is our customary decimal expression. It is worth noting that in any system, integers and integers are one-to-one correspondence, and decimals and decimals are one-to-one correspondence. For example, integer in decimal system corresponds to integer in binary system, so does decimal. In other words, the real number of (0, 1) interval on the real number axis corresponds to the decimal number of (0, 1) under the binary system one by one. It should be noted that we often ignore binary decimal; in fact, it is the main technical support of various arithmetic codes, such as Shannon code.

Now let us see the b-ary expression of a positive integer n in the decimal system. Let

$$n=(d_{k-1}d_{k-2}\cdots d_{1}d_{0})_{b},0 \le d_{i}<b,d_{k-1}\ne 0,$$

k is the number of b-ary digits of n.

Lemma 1.2

The number k of b-ary digits of positive integer n can be calculated according to the following formula:

$$\begin{aligned} k=[\log _{b}{n}]+1, \end{aligned}$$

(1.7)

[x] denotes the largest integer not greater than the real number x.

Proof

Because of $d_{k-1} \ne 0$, there is $b^{k-1}\le n<b^{k}$, that is

$$\begin{aligned} k-1 \le \log _{b}{n}<k, \end{aligned}$$

There’s $k-1 \le [\log _{b}{n}]$ on the left, $[\log _{b}{n}]+1 \le k$ on the right, and together there’s

$$\begin{aligned} k=[\log _{b}{n}]+1. \end{aligned}$$

We complete the proof of Lemma 1.2.

Now let us see the addition operation in b-ary system. For simplicity, we consider the addition of two positive integers in binary system. Let $n=(1111000)_{2}$, $m=(11110)_{2}$, then $n+m=1111000+0011110=10010110$, that is $n+m=(10010110)_{2}$. The addition of numbers on the same bit actually includes the following five contents (or operations).

Observe the numbers in the same bit and note if there are progressions in the right bit(Every two goes into one).

If the upper and lower digits of the same bit are 0, and there is no progression on the right side, the sum of the two digits is 0.

If both the upper and lower digits of the same digit are 0, but there is a progression, or if one of the two digits is 0 and the other is 1, and there is no progression, the two digits in this digit add up to 1.

If two digits of the same digit have one 0, the other one is 1, and there is one progression, or two digits are 1, and there is no progression, the result of addition is 0, and one progression is put forward.

If two digits are 1 and have one progression, the sum result is 1 and one progression forward.

Definition 1.3

A bit operation is an addition operation on the same bit in binary addition. Suppose A is an algorithm in binary system, we use $\text {Time}(A)$ to represent the number of bit operations in algorithm A, that is, $\text {Time}(A)=$ completes the total number of bit operations performed by algorithm A.

It is easy to deduce the number of bit operations of binary about addition and subtraction by definition. Let n, m be two positive integers, and their binary expression bits are k and l respectively, then

$$\begin{aligned} \text {Time}(n \pm m)=\max \{k,l\}. \end{aligned}$$

(1.8)

In the same way, the number of bit operations required for the multiplication of B and D in binary system is satisfied

$$\begin{aligned} \text {Time}(nm)\le (k+l)\cdot \min \{k,l\}\le 2kl. \end{aligned}$$

(1.9)

It is very convenient to estimate the number of bit operations by using the symbol $``O''$ commonly used in number theory. If f(x) and g(x) are two real valued functions, $g(x)>0$, suppose there are two absolute constants B and C such that when $|x|>B$, we have

$$\begin{aligned} |f(x)|\le Cg(x), \ \ \ \text {notes}~f(x)=O(g(x)). \end{aligned}$$

This sign indicates that when $x\rightarrow \infty $, the order of growth of f(x) is the same as that of g(x). For example, let $f(x)=a_{d}x^{d}+a_{d-1}x^{d-1}+\cdots +a_{1}x+a_{0}(a_{d}>0)$, then

$$\begin{aligned} f(x)=O(|x|^{d}), \ \ \ \ \text {or}~ f(n)=O(n^{d}), ~n\ge 1. \end{aligned}$$

For any $\varepsilon >0$, there is

$$\begin{aligned} \log n=O(n^{\varepsilon }), ~n\ge 1. \end{aligned}$$

From the lemmas of 1.2, (1.8) and (1.9), we have

Lemma 1.3

Let n, m be two positive integers, k and l are the bits of their binary expression, respectively, if $m\le n$, then $l\le k$, and

$$ \text {Time}(n\pm m)=O(k)=O(\log {n});$$

$$ \text {Time}(nm)=O(kl)=O(\log {n}\log {m});$$

$$\text {and}\ \ \ \text {Time} (\frac{n}{m})=O(kl)=O(\log {n}\log {m}).$$

In the above lemma, division is similar to multiplication. Next, we discuss the number of bit operations required to convert a binary representation into a decimal representation, and the number of bit operations required for n! to operate in binary.

Lemma 1.4

Let k be the number of digits in binary of n, then

$$\begin{aligned} \text {Time}(n \text {~convert to decimal expression})=O(k^{2})=O(\log ^{2}{n}) \end{aligned}$$

and

$$\begin{aligned} \text {Time}(n!)=O(n^{2}k^{2})=O(n^{2}\log ^{2}{n}). \end{aligned}$$

Proof

To convert $n=(d_{k-1}d_{k-2}\cdots d_{1}d_{0})_{2}$ to decimal expression. Then divide n by $10=(1010)_{2}$, and the remainder is 0, 1, 10, 11, 100, 101, 110, 111, 1000 or 1001, one of these binary numbers, these ten numbers correspond to one of the numbers from 0 to 9 and are denoted as $a_{0}(0\le a_{0}\le 9)$, put $a_{0}$ as the decimal number of n. Similarly, divide the quotient by $10=(1010)_{2}$, and the remainder is converted into a number from 0 to 9 as the ten digits of n in decimal system. If we go on like this, we use division $[\log _{10}{n}]+1$ times, the bit operation required for each division is O(4k), so

$$\begin{aligned} \text {Time}(n\text {~convert to decimal expression})\le k\cdot O(4k)=O(k^{2}). \end{aligned}$$

In the same way, we can prove the bit operation estimation of n!. We complete the proof of Lemma 1.4.

Let us deduce the computational complexity of some common number theory algorithms. Let m and n be two positive integers, then there is a nonnegative integer r such that $m~\equiv r({{\,\mathrm{mod}\,}}n)$, where $0\le r < n$, we call r the smallest nonnegative residue of m under ${{\,\mathrm{mod}\,}}n$, and denote as $r=m {{\,\mathrm{mod}\,}}n$. If $1 \le m \le n$, Euclid’s division method is usually used to find the greatest common divisor (n, m) of n and m. If $(m, n)=1$, then there is a positive integer a such that $ma~\equiv 1({{\,\mathrm{mod}\,}}n)$, a is called the multiplicative inverse of m under ${{\,\mathrm{mod}\,}}n$, denote as $m^{-1} {{\,\mathrm{mod}\,}}n$. By Bezout formula, if $(n, m)=1$, then there are integers x and y such that $xm+yn=1$, we usually use the extended Euclid algorithm to find x and y. If we find x, we actually calculate $m^{-1} {{\,\mathrm{mod}\,}}n$. Under the above definitions and notations, we have

Lemma 1.5

(i) Suppose m and n are two positive integers, then

$$ \text {Time}(\text {calculate}~ m {{\,\mathrm{mod}\,}}n)=O(\log n \cdot \log m).$$

(ii) Suppose m and n are two positive integers, and $m \le n$, then

$$ \text {Time}(\text {calculate}~ (n, m))=O(\log ^{3} n).$$

(iii) Suppose m and n are two positive integers, and $(m, n)=1$, then

$$ \text {Time}(\text {calculate}~ m^{-1} {{\,\mathrm{mod}\,}}n )=O(\log ^{3} \max (n, m)).$$

(iv) Suppose n, m, b are positive integers, $b<n$, then

$$\text {Time}(b^{m} {{\,\mathrm{mod}\,}}n)=O(\log m \cdot \log ^{2} n).$$

Proof

To find the minimum nonnegative residue r of m under ${{\,\mathrm{mod}\,}}n $ is actually a division with remainder!

$$m=kn+r, ~0\le r < n.$$

From the lemma 1.3,

$$ \text {Time}(\text {calculate}~ m {{\,\mathrm{mod}\,}}n)=O(\log n \cdot \log m), $$

(i) holds. The Euclid algorithm used to calculate the greatest common divisor (n, m) of n and m, in fact, it is a division of $O(\log n)$ times with remainder, so

$$ \text {Time}(\text {calculate}~ (n, m) )=O(\log ^{3}n).$$

In Euclid algorithm, we can get x and y by pushing from bottom to top, such that $xm+yn=1$, this incremental process is called the expansion of Euclid algorithm, therefore, if $m \le n$, then

$$ \text {Time}(\text {calculate}~ m^{-1} {{\,\mathrm{mod}\,}}n )=\text {Time}~(\text {calculate} (n, m) )=O(\log ^{3} n).$$

(iv) the computational complexity of the power of an integer under ${{\,\mathrm{mod}\,}}n$. the proof method is the famous “repeated square method” . Let

$$\begin{aligned} \begin{aligned} m&=(m_{k-1}m_{k-2}\cdots m_{1}m_{0})_{2}\\&=m_{0}+2m_{1}+4m_{2}+\cdots +2^{k-1}m_{k-1} \end{aligned} \end{aligned}$$

be the binary representation of m, where $m_{i}=0$ or 1. First, let $a=1$. if $m_{0}=1,$ replace a with b, if $m_{0}=0,$ then $a=1$ remains unchanged, and let $b_{1}=b^{2} {{\,\mathrm{mod}\,}}n$, this is the first square. If $m_{1}=1,$ replace a with $ab_{1} {{\,\mathrm{mod}\,}}n$, if $m_{1}=0,$ a remains unchanged, and let $b_{2}=b_{1}^{2} {{\,\mathrm{mod}\,}}n$, this is the second square. So if we go on to the square of j, we have

$$b_{j} \equiv b^{2^{j}} ({{\,\mathrm{mod}\,}}n).$$

Our calculation ends after the square of $(k-1)$; at this time, there is

$$a \equiv b^{m_{0}+2m_{1}+4m_{2}+\cdots +2^{k-1}m_{k-1}} \equiv b^{m} ({{\,\mathrm{mod}\,}}n).$$

Obviously, the number of bit operations per square is $O((\log n^{2})^{2})=O(\log ^{2}n)$. There is a total of k square operations, $k=O(\log m)$. So (iv) holds. We have completed the proof.

Definition 1.4

If an algorithm f involves positive integers $n_{1}, n_{2}, \ldots , n_{r}$, whose binary digits are $k_{1}, k_{2}, \ldots , k_{r}$, and there are absolute nonnegative integers $d_{1}, d_{2}, \ldots , d_{r}$ such that

$$\begin{aligned} \text {Time}(f)=O(k_{1}^{d_{1}}k_{2}^{d_{2}}\cdots k_{r}^{d_{r}}), \end{aligned}$$

(1.10)

The complexity of algorithm f is called polynomial; otherwise, it is called nonpolynomial.

From Lemma 1.4 , we can see that addition, subtraction, multiplication and division between positive integers are polynomial algorithms, but n! operation is the simplest example of nonpolynomial algorithm. If we do not need an exact value of n! and only need an approximate value, we can get an approximate value of n! by using a polynomial term algorithm based on Stirling formula (see Sect. 1.4 of this chapter). In the formula (1.10), if $d_{1}=d_{2}=\cdots =d_{r}=0$, the complexity of algorithm f is constant, if $d_{1}=d_{2}=\cdots =d_{r}=1$, the complexity of the algorithm f is said to be linear (the same is true for quadratic, cubic, etc.). In order to characterize nonpolynomial algorithms, we introduce two concepts: exponential and subexponential algorithms.

Definition 1.5

Suppose that an algorithm f involves a positive integer n, and its binary digits are k, if

$$\begin{aligned} \text {Time}(f)=O(t^{g(k)}), \end{aligned}$$

(1.11)

where t is a constant greater than 1, and g(k) is a polynomial function of k and $\deg g \ge 1$, then the computational complexity of f is exponential. If g(k) is not a polynomial function, but a function smaller than a polynomial, such as $e^{\sqrt{k\log k}}$, then the computational complexity of f is subexponential.

From the above definition, we can see the computational complexity of n!, let k be the binary number of n, from 1.2, then $n=O(2^{k})$, and then from 1.4,

$$ \text {Time} (n!)=O(n^{2}k^{2})=O(k^{2}2^{2k})=O(2^{3k}),$$

So the computational complexity of n! in binary system is exponential. This is the simplest example of exponential algorithm.

Bit algorithm cannot only define the computational complexity but also describe the running speed and time complexity of computer. The so-called computer speed refers to the total number of bits that the computer can complete in unit time (such as a second, or 1 microsecond). Therefore, there is no difference between the computational complexity and the time complexity of an algorithm. We can use the figure below to illustrate, suppose that a computer can complete $10^{6}$ bit operations in one second. When the binary bit of the algorithm is $k=10^{6}$, the following figure lists the running time of different computational complexity algorithms on this computer (Table 1.1).

Table 1.1

Time requirements of algorithms with different computational complexity ($k=10^6$)

Algorithm type	Complexity	Number of bit operations	Time
Constant degree	O(1)	1	1 microsecond
Linear	O(k)	$10^6$	1 s
Quadratic	$O(k^2)$	$10^{12}$	11.6 days
Cubic	$O(k^3)$	$10^{18}$	32000 years
Subexponential	$O(e^{\sqrt{k\log k}})$	About $1.8\times 10^{1618}$	$6 \times 10^{1604}$years
Exponential	$O(2^k)$	$10^{301030}$	$3\times {10^{301016} }$years

Note that 1 year $\approx 3\times 10^{7}$ seconds, the age of the universe is about $10^{10}$ years; when the number of binary digits k is large, the algorithm with exponential or subexponential computational complexity is actually impossible to complete on the computer; therefore, the only way to solve the problem is to improve the speed of the computer.

Computational complexity is often used to describe the complexity of a problem, because the computational complexity is also time complexity when the computer hardware conditions (such as computing speed and storage capacity) remain unchanged. At present, the complexity of algorithms is defined in a model called Turing machine. Turing machine is a kind of finite state machine with infinite read and write ability. If the result of each operation and the content of the next operation are uniquely determined, such Turing machine is called deterministic Turing machine. Therefore, the determinacy of a polynomial algorithm is accomplished on a determinate Turing machine.

Definition 1.6

If a problem can be solved by polynomial algorithm on a certain Turing machine, it is called a P class problem, and the P class problem is often called an easy to handle problem. If a problem can be solved by polynomial algorithm on an uncertain Turing machine, it is called a NP class problem.

According to the definition, the P class problem is definitely a NP class problem, because it can be solved by polynomial algorithm on deterministic Turing machine, and it can also be solved by polynomial algorithm on nondeterministic Turing machine. On the other hand, is the NP problem strictly larger than the P problem? This is an open problem that has not been solved in the field of theoretical computer. There is neither strict proof nor counterexample to show that a problem that can be solved by polynomial on a nondeterministic Turing machine cannot be solved by polynomial algorithm on a deterministic Turing machine. It is widely speculated that the problem of P class and NP class is not equivalent, which is also the cornerstone of many cryptosystems.

1.3 Jensen Inequality

A real valued function f(x) in the interval (a, b) is called a strictly convex function, if for $\forall ~ x_{1}, x_{2} \in (a, b), \lambda _{1}>0, \lambda _{2}>0, \lambda _{1}+\lambda _{2}=1$, we have

$$\begin{aligned} \lambda _{1}f(x_{1})+\lambda _{2}f(x_{2})\le f(\lambda _{1}x_{1}+\lambda _{2}x_{2}), \end{aligned}$$

and the equation holds if and only if $x_{1}=x_{2}$. By inductive method, we can prove the Jensen inequality as follows.

Lemma 1.6

If f(x) is a strictly convex function over (a, b), then for any positive integer $n>1$, any positive number $\lambda _{i}(1\le i\le n)$, $\lambda _{1}+\lambda _{2}+\cdots +\lambda _{n}=1$ and any $x_{i}\in (a, b)(1\le i\le n)$, we have

$$\begin{aligned} \sum _{i=1}^{n}\lambda _{i}f(x_{i})\le f(\sum _{i=1}^{n}\lambda _{i}x_{i}), \end{aligned}$$

(1.12)

the equation holds if and only if $x_{1}=x_{2}=\cdots =x_{n}$.

Proof

By inductive method, the proposition holds when $n=1$ and $n=2$. Suppose the proposition holds for $n-1$. When $n>2$, let

$$\begin{aligned} x'=\frac{\lambda _{1}}{\lambda _{1}+\lambda _{2}}x_{1}+\frac{\lambda _{2}}{\lambda _{1}+\lambda _{2}}x_{2}, \end{aligned}$$

it can be seen that $x'\in (a,b)$ and $(\lambda _{1}+\lambda _{2})x'=\lambda _{1}x_{1}+\lambda _{2}x_{2}$, therefore,

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{n}\lambda _{i}f(x_{i})&=\lambda _{1}f(x_{1})+\lambda _{2}f(x_{2})+ \sum _{i=3}^{n}\lambda _{i}f(x_{i})\\&\le (\lambda _{1}+\lambda _{2})f(x')+\sum _{i=3}^{n}\lambda _{i}f(x_{i})\\&\le f(\lambda _{1}x_{1}+\lambda _{2}x_{2}+\cdots +\lambda _{n}x_{n}). \end{aligned} \end{aligned}$$

We have the proposition that holds for n. Thus, the inequality (1.12) holds.

From the knowledge of mathematical analysis, f(x) is called a strictly convex function in the interval (a, b) if and only if $f^{''}(x)<0$. Take $f(x)=\log x$, then $f^{''}(x)=\frac{-1}{x^{2}\ln 2}$, thus $\log x$ is a strictly convex function on the interval of $(0, +\infty )$, from Jensen inequality, we have the following inequality.

Lemma 1.7

Let g(x) be positive function, that is $g(x)>0$, then for any integers $\lambda _{i}(1\le i\le n)$, $\lambda _{1}+\lambda _{2}+\cdots +\lambda _{n}=1$, and any $a_{1}, a_{2}, \ldots , a_{n}$, we have

$$\begin{aligned} \sum _{i=1}^{n}\lambda _{i} \log g(a_{i})\le \log \sum _{i=1}^{n}\lambda _{i}g(a_{i}), \end{aligned}$$

(1.13)

the equation holds if and only if $g(a_{1})=g(a_{2})=\cdots =g(a_{n})$.

Proof

Because $\log x$ is strictly convex, let $x_{i}=g(a_{i})$, then $x_{i}\in (0, +\infty )(1\le i\le n)$, by Jensen inequality,

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{n}\lambda _{i} \log g(a_{i})&=\sum _{i=1}^{n}\lambda _{i} \log x_{i}\\&\le \log (\sum _{i=1}^{n}\lambda _{i} x_{i})\\&=\log (\sum _{i=1}^{n}\lambda _{i} g(a_{i})). \end{aligned} \end{aligned}$$

So the lemma holds.

A real valued function f(x) is called a strictly convex function in the interval (a, b), if for $\forall ~ x_{1}, x_{2} \in (a, b), \lambda _{1}>0, \lambda _{2}>0, \lambda _{1}+\lambda _{2}=1$, we have

$$\begin{aligned} f(\lambda _{1}x_{1}+\lambda _{2}x_{2}) \le \lambda _{1}f(x_{1})+\lambda _{2}f(x_{2}), \end{aligned}$$

and the equation holds if and only if $x_{1}=x_{2} $. By induction, we can prove the following general inequality.

Lemma 1.8

If f(x) is called a strictly convex function in the interval (a, b), then for any positive integer $n\ge 2$, any positive numbers $\lambda _{i}(1\le i\le n)$, $\lambda _{1}+\lambda _{2}+\cdots +\lambda _{n}=1$ and any $x_{i}\in (a, b)(1\le i\le n)$, then we have

$$\begin{aligned} f(\sum _{i=1}^{n}\lambda _{i}x_{i})\le \sum _{i=1}^{n}\lambda _{i}f(x_{i}), \end{aligned}$$

(1.14)

the equation holds if and only if $x_{1}=x_{2}=\cdots =x_{n}$.

We know that f(x) is strictly convex in the interval (a, b) if and only if $f^{''}(x)>0$. Let $f(x)=x\log x$, then $f^{''}(x)=\frac{1}{x\ln 2}>0$, when $x\in (0, +\infty )$. Then we have the following logarithmic inequality.

Lemma 1.9

If $a_{1}, a_{2}, \ldots , a_{n}$ and $b_{1}, b_{2}, \ldots , b_{n}$ are two groups of positive numbers, then there are

$$\begin{aligned} \sum _{i=1}^{n} a_{i} \log \frac{a_{i}}{b_{i}}\ge (\sum _{i=1}^{n}a_{i})\log \frac{\sum _{i=1}^{n}a_{i}}{\sum _{i=1}^{n} b_{i}}. \end{aligned}$$

(1.15)

Proof

Because $f(x)=x\log x$ is a strictly convex function, from 1.8, we have

$$\begin{aligned} f(\sum _{i=1}^{n}\lambda _{i}x_{i})\le \sum _{i=1}^{n}\lambda _{i}f(x_{i}), \end{aligned}$$

where $ \sum _{i=1}^{n}\lambda _{i}=1$. Take $\lambda _{i}=\frac{b_{i}}{\sum _{j=1}^{n}b_{j}}, x_{i}=\frac{a_{i}}{b_{i}}$, then

$$\begin{aligned} \frac{1}{\sum _{j=1}^{n}b_{j}}\sum _{i=1}^{n} a_{i} \log \frac{\sum _{i=1}^{n}a_{i}}{\sum _{i=1}^{n}b_{i}}\le \sum _{i=1}^{n}\frac{a_{i}}{\sum _{j=1}^{n}b_{j}}\log \frac{a_{i}}{b_{i}}, \end{aligned}$$

$\sum _{j=1}^{n}b_{j}$ is deleted at the same time on both sides, then there is

$$\begin{aligned} (\sum _{i=1}^{n}a_{i})\log \frac{\sum _{i=1}^{n}a_{i}}{\sum _{i=1}^{n}b_{i}} \le \sum _{i=1}^{n}a_{i}\log \frac{a_{i}}{b_{i}}, \end{aligned}$$

thus (1.15) holds.

The above formula is called logarithm sum inequality, which is often used in information theory.

1.4 Stirling Formula

In number theory (see reference 1’s Apostol 1976), we can get the average asymptotic formula of some arithmetic functions by using the Euler sum formula, the most important of which is the following Stirling formula. For all real numbers $x\ge 1$, we have

$$\begin{aligned} \sum _{1\le m \le x}\log m=x\log x-x+O(\log x), \end{aligned}$$

(1.16)

where the O constant is an absolute constant. Take $x=n\ge 1$ as a positive integer, then there is Stirling formula

$$\begin{aligned} \log n!=n\log n-n+O(\log n). \end{aligned}$$

(1.17)

In number theory, the Stirling formula appears in the more precise form below,

$$n!\approx \sqrt{2\pi n}(\frac{n}{e})^n$$

$$\lim _{n\rightarrow \infty }\frac{n!}{\sqrt{2\pi n}(\frac{n}{e})^n}=1.$$

Lemma 1.10

Let $0\le m \le n, n, m$ be nonnegative integer, and $\left( {\begin{array}{c}n\\ m\end{array}}\right) $ be the combination number, then

$$\begin{aligned} \left( {\begin{array}{c}n\\ m\end{array}}\right) \le \frac{n^{n}}{m^{m}(n-m)^{n-m}}. \end{aligned}$$

(1.18)

Proof

$$\begin{aligned} n^{n}=(m+(n-m))^{n}\ge \left( {\begin{array}{c}n\\ m\end{array}}\right) m^{m}(n-m)^{n-m}, \end{aligned}$$

The (1.18) follows at once.

We define the binary entropy function $H(x)(0\le x\le 1)$ as follows.

$$\begin{aligned} H(x)={\left\{ \begin{aligned}&0, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text {if}~ x=0.\\&-x\log {x}-(1-x)\log {(1-x)},\ \ \ \ \ \ \text {if}~0<x\le 1. \end{aligned} \right. } \end{aligned}$$

(1.19)

It is obvious that $H(x)=H(1-x)$. So we only need to consider the case of $0\le x \le \frac{1}{2}$. H(x) is the information entropy of binary information space (see the example 3.5 in Sect. 1.1 of Chap. 3), the image description is as follows (Fig. 1.1):

Lemma 1.11

Let $0\le \lambda \le \frac{1}{2}$, then we have

(i) $\sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) \le 2^{nH(\lambda )}.$

(ii) $\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) \le nH(\lambda ).$

(iii) $\lim _{n\rightarrow \infty }\frac{1}{n}\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) = H(\lambda ).$

Proof

We first prove that (i), (ii) can be obtained directly from the logarithm of (i).

$$\begin{aligned} \begin{aligned} 1=(\lambda +(1-\lambda ))^{n}&\ge \sum _{0\le i\le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) \lambda ^{i}(1-\lambda )^{n-i}\\&=\sum _{0\le i\le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) (1-\lambda )^{n}(\frac{\lambda }{1-\lambda })^{i}\\&\ge \sum _{0\le i\le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) (1-\lambda )^{n}(\frac{\lambda }{1-\lambda })^{\lambda n}\\&=2^{-nH(\lambda )}\sum _{0\le i\le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) . \end{aligned} \end{aligned}$$

In order to prove that (iii), we write $m=[\lambda n]=\lambda n+O(1)$, from (ii), we have

$$\begin{aligned} \frac{1}{n}\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) \le H(\lambda ). \end{aligned}$$

on the other hand,

$$\begin{aligned} \begin{aligned} \frac{1}{n}\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) \ge \frac{1}{n}\log \left( {\begin{array}{c}n\\ m\end{array}}\right) \\ =\frac{1}{n}\{ \log n!-\log m!-\log (n-m)!\}. \end{aligned} \end{aligned}$$

From the Stirling formula (1.17), we have

$$\begin{aligned} \log n!-\log m!-\log (n-m)!=n\log n-m\log m-(n-m)\log (n-m)+O(\log n). \end{aligned}$$

So there are

$$\begin{aligned} \begin{aligned} \frac{1}{n}\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right)&\ge \log n-\lambda \log \lambda n-(1-\lambda )\log n(1-\lambda )+O(\frac{\log n}{n})\\&=-\lambda \log \lambda -(1-\lambda )\log (1-\lambda )+O(\frac{\log n}{n})\\&=H(\lambda )+O(\frac{\log n}{n}). \end{aligned} \end{aligned}$$

In the end, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\log \sum _{0\le i \le \lambda n}\left( {\begin{array}{c}n\\ i\end{array}}\right) = H(\lambda ). \end{aligned}$$

Lemma 1.11 holds.

1.5 n-fold Bernoulli Experiment

In a given probability space, suppose that x is a random event and y is a random event. We denote the probability of event x occurrence by p(x), the probability of joint occurrence of x and y is denoted by p(xy) and the probability of occurrence of x under the condition of event y is denoted by p(x|y), which is called conditional probability. Obviously, there is a multiplication formula as follows:

$$\begin{aligned} p(xy)=p(y)p(x|y). \end{aligned}$$

(1.20)

Two events x and y, if $p(xy)=0$, say x and y are incompatible, if $p(xy)=p(x)p(y)$, say two events are independent, or independent of each other.

A finite set of events $\{x_{1}, x_{2}, \ldots , x_{n}\}$ is called complete event group, if

$$\begin{aligned} \sum _{i=1}^{n}p(x_{i})=1, ~\text {and}~ p(x_{i}y_{j})=0, ~\text {when}~i\ne j . \end{aligned}$$

(1.21)

In a complete event group, we can assume that $0<p(x_{i})\le 1 (1\le i \le n).$

Total probability formula: If $\{x_{1}, x_{2}, \ldots , x_{n}\}$ is a complete event group, y is any random event, then we have

$$\begin{aligned} p(y)=\sum _{i=1}^{n}p(yx_{i}) \end{aligned}$$

(1.22)

and

$$\begin{aligned} p(y)=\sum _{i=1}^{n}p(x_{i})p(y|x_{i}). \end{aligned}$$

(1.23)

Lemma 1.12

Let $\{x_{1}, x_{2}, \ldots , x_{n}\}$ is a complete event group, then the event y can and can only occur simultaneously with a certain $x_{i}$, so for any $i, ~1\le i \le n$, we have the following Bayes formula:

$$\begin{aligned} p(x_{i}|y)=\frac{p(x_{i})p(y|x_{i})}{\sum _{j=1}^{n}p(x_{j})p(y|x_{j})}, ~1\le i \le n. \end{aligned}$$

(1.24)

Proof

From the product formula (1.20), we have

$$\begin{aligned} p(x_{i}y)=p(y)p(x_{i}|y)=p(x_{i})p(y|x_{i}). \end{aligned}$$

then there is

$$\begin{aligned} p(x_{i}|y)=\frac{p(x_{i})p(y|x_{i})}{p(y)}. \end{aligned}$$

And from the total probability formula (1.23), then we can know

$$\begin{aligned} p(x_{i}|y)=\frac{p(x_{i})p(y|x_{i})}{\sum _{j=1}^{n}p(x_{j})p(y|x_{j})}, ~1\le i \le n, \end{aligned}$$

the Bayes formula (1.24) is proved.

Now we discuss the n-fold Bernoulli experiment. In statistical test, the test with only two possible results is called Bernoulli experiment, and the experiment satisfying the following agreement is called n-fold Bernoulli experiment:

(1) There are at most two possible results in each experiment: a or $\bar{a}$.

(2) The probability p of occurrence of a in each test remains unchanged.

(3) Each experiment is statistically independent.

(4) A total of n experiments were carried out.

Lemma 1.13

(Bernoulli theorem) In Bernoulli experiment, the probability of event a is p, and then in the n-fold Bernoulli experiment, the probability B(k; n, p) of a appearing $k (0 \le k \le n)$ times is

$$\begin{aligned} B(k;n,p)=\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{k}q^{n-k}, ~q=1-p. \end{aligned}$$

(1.25)

Proof

The results of the i-th Bernoulli test are recorded as $x_{i}(x_{i}=a~ \text {or}~\bar{a})$, then n-fold Bernoulli experiment forms the following joint event x

$$\begin{aligned} x=x_{1}x_{2}\cdots x_{n}, ~x_{i}=a~ \text {or}~\bar{a}. \end{aligned}$$

Because of the independence of the experiment, when there are exactly k $x_{i}=a$, the occurrence probability of x is

$$\begin{aligned} p(x)=p(x_{1})p(x_{2})\cdots p(x_{n})=p^{k}q^{n-k}. \end{aligned}$$

Obviously, there are exactly k joint events of $x_{i}=a$, and the total number is $x_{i}=a$, so

$$\begin{aligned} B(k;n,p)=\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{k}q^{n-k}. \end{aligned}$$

Lemma 1.13 holds.

In the same way, we can calculate the probability of event a appearing at the k-th in multiple Bernoulli experiments.

Lemma 1.14

Suppose that a and $\bar{a}$ are two possible events in Bernoulli experiments, then the probability of the first appearance of a in the k-th Bernoulli experiment is $pq^{k-1}$.

Proof

Joint event $x=x_{1}x_{2} \cdots x_{k}$ formed by k-fold Bernoulli experiment, where $k-1$ $x_{i}=\bar{a}$, and $x_{k}=a$, then

$$\begin{aligned} p(x)=p(x_{1})\cdots p(x_{k-1})p(x_{k})=pq^{k-1}. \end{aligned}$$

We have completed the proof.

n-fold Bernoulli experiment is not only the most basic probability model in probability and statistics, but also a common tool in communication field. Next, we take the error of binary channel transmission as an example to illustrate.

Example 1.1

(Error probability of binary channel) In binary channel transmission, a codeword x of length n is a vector $x=(x_{1}, x_{2}, \ldots , x_{n})$ in n-dimensional vector space $\mathbb {F}_{2}^{n}$, where $x_{i}=0$ or $1(1\le i \le n)$. For convenience, let us write $x=x_{1} x_{2} \cdots x_{n}$. Due to channel interference, characters 0 and 1 may have errors in transmission, that is, 0 becomes 1, 1 becomes 0, let the error probability be p (p may be very small), and the error probability of each transmission is constant. Under the above assumption, the codeword x with a transmission length of n can be regarded as a n-fold Bernoulli experiment, and the error probability B(k; n, p) of $k(0\le k \le n)$ errors of x in transmission is

$$\begin{aligned} B(k;n,p)=\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{k}q^{n-k}, ~q=1-p. \end{aligned}$$

1.6 Chebyshev Inequality

We call the variable $\xi $ defined as a real number in a probability space a random variable. For any real number $x\in (-\infty , +\infty )$, p(x) is defined as the probability of the value x of the random variable $\xi $, i.e.,

$$\begin{aligned} p(x)=P\{\xi =x\}, \end{aligned}$$

(1.26)

Call p(x) the probability function of $\xi $. If $\xi $ has only a finite number of values, or countable infinite values, that is, the value space of $\xi $ is a finite number of real numbers, or countable infinite real numbers, then $\xi $ is called discrete random variable; otherwise, $\xi $ is called continuous random variable. The distribution function F(x) of a random variable $\xi $ is defined as

$$\begin{aligned} F(x)=P\{\xi \le x\}, ~x\in (-\infty , +\infty ). \end{aligned}$$

(1.27)

Obviously, the distribution function F(x) of $\xi $ is defined as a monotone increasing function on the whole real axis $(-\infty , +\infty )$. And it is a right continuous function, that is $F(x_{0})=\lim \limits _{x\rightarrow x_{0}+0}F(x)$. The probability distribution of a random variable $\xi $ is completely determined by its distribution function F(x), in fact, for any x,

$$\begin{aligned} p(x)=P\{\xi =x\}=F(x)-F(x-0). \end{aligned}$$

Let f(x) be a nonnegative integrable function on the real axis. And

$$\begin{aligned} F(x)=\int \limits _{-\infty }^{x}f(t)\mathrm {~d}t, \end{aligned}$$

(1.28)

f(x) is called the density function of the random variable $\xi $. Obviously, the density function satisfies:

$$\begin{aligned} f(x)\ge 0, ~\forall x \in (-\infty , +\infty ), \int \limits _{-\infty }^{+\infty }f(x)\mathrm {~d}x=1. \end{aligned}$$

(1.29)

On the other hand, the function f(x) satisfying the formula (1.29) must be the density function of a random variable. Here, we introduce several common continuous random variables and their probability distribution.

1. Uniform distribution(Equal probability distribution)

A random variable $\xi $ is equal probability value in interval [a, b], and $\xi $ is said to be uniformly distributed, or it is also called a random variable of uniformly distributed, and its density function is

$$\begin{aligned} f(x)={\left\{ \begin{aligned}&\frac{1}{b-a}, \ \ \ \ \ \ \ ~a\le x\le b.\\&0,\ \ \ \ \ \ \ \ \ \ \ \ \ \text {otherwise}. \end{aligned} \right. } \end{aligned}$$

Its distribution function F(x) is

$$\begin{aligned} F(x)={\left\{ \begin{aligned}&0, \ \ \ \ \ \ \ \ \ \ \ \ \text {when}~x< a.\\&\frac{x-a}{b-a}, \ \ \ \ \ \ \text {when}~a\le x\le b.\\&1,\ \ \ \ \ \ \ \ \ \ \ \ \text {when}~x> b. \end{aligned} \right. } \end{aligned}$$

2. Exponential distribution

The density function of random variable $\xi $ is

$$\begin{aligned} f(x)={\left\{ \begin{aligned}&\lambda \mathrm {e}^{-\lambda x}, \ \ \ \ \text {when}~x\ge 0.\\&0,\ \ \ \ \ \ \ \ \ \ \text {when}~x<0. \end{aligned} \right. } \end{aligned}$$

where the given parameter is $ \lambda \ge 0$, and its distribution function is

$$\begin{aligned} F(x)={\left\{ \begin{aligned}&1-\mathrm {e}^{-\lambda x}, \ \ \ \ \ \ \text {when}~x\ge 0.\\&0,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text {when}~x<0. \end{aligned} \right. } \end{aligned}$$

We call $\xi $ an exponential distribution with parameter $\lambda $ or a random variable with exponential distribution. 3. Normal distribution

A continuous random variable $\xi $ whose density function f(x) is defined as:

$$\begin{aligned} f(x)=\frac{1}{\sqrt{2\pi }\sigma }\mathrm {e}^{-\frac{(x-\mu )^{2}}{2\sigma ^{2}}}, ~x\in (-\infty , +\infty ). \end{aligned}$$

where $\mu $ and $\sigma $ are constants, $\sigma > 0$. We say that $\xi $ obeys the normal distribution with parameters of $\mu $ and $\sigma ^{2}$, and denote as $\xi \sim N \left( \mu , \sigma ^{2}\right) $. By Possion integral,

$$ \int \limits _{-\infty }^{+\infty } \mathrm {e}^{-x^{2}} \mathrm {~d} x=\sqrt{\pi }, $$

it is not hard to verify

$$ \int \limits _{-\infty }^{+\infty } f(x) \mathrm {~d} x=1. $$

The distribution function F(x) of normal distribution $N\left( \mu , \sigma ^{2}\right) $ is

$$ F(x)=\frac{1}{\sqrt{2 \pi } \sigma } \int \limits _{-\infty }^{x} \mathrm {e}^{-\frac{(t-\mu )^{2}}{2 \sigma ^{2}}} \mathrm {~d} t. $$

When $\mu =0, \sigma =1$, N(0, 1) is called standard normal distribution.

Let us define the mathematical expectation and variance of a random variable $\xi $. First, let us see the mathematical expectation of a discrete random variable .

(1) Let $\xi $ be a discrete random variable whose value space is $\{x_{1}, x_{2}, \ldots , x_{n}, \ldots \}$. And let $p(x_{i})=P\left\{ \xi =x_{i}\right\} $. If

$$ \sum _{i=1}^{+\infty }\left| x_{i}\right| p(x_{i})<\infty , $$

Then the mathematical expectation $E(\xi )$ of $\xi $ is defined as

$$\begin{aligned} E\xi =E(\xi )=\sum _{i=1}^{+\infty }x_{i} p(x_{i}). \end{aligned}$$

(1.30)

(2) Let $\xi $ be a continuous random variable and f(x) be its density function, if

$$ \int \limits _{-\infty }^{+\infty }|x| f(x) \mathrm {d} x<+\infty , $$

Then the mathematical expectation $E(\xi )$ of $\xi $ is defined as

$$\begin{aligned} E\xi =E(\xi )=\int \limits _{-\infty }^{+\infty }x f(x) \mathrm {d} x. \end{aligned}$$

(1.31)

(3) Let h(x) be a real valued function, then $h(\xi )$ is also a random variable, and $h(\xi )$ is called a function of the random variable $\xi $. The mathematical expectation $E(h(\xi ))$ of $h(\xi )$ is $E_{h}(\xi )$.

Lemma 1.15

(1) Let $\xi $ be a discrete random variable whose value space is $\{x_{1}, x_{2}, \ldots $, $x_{n}, \ldots \}$, if $E(\xi )$ exists, then $E_ {h}(\xi )$ also exists, and

$$ E_ {h}(\xi )=\sum _{i=1}^{+\infty } h\left( x_{i}\right) p(x_{i}). $$

(2) If $\xi $ is a continuous random variable, and $E(\xi )$ exists, then $E_ {h}(\xi )$ also exists, and

$$ E_ {h}(\xi )=\int \limits _{-\infty }^{+\infty } h(x) f(x) \mathrm {d} x. $$

Proof

Let the value space of $\eta =h(\xi )$ be $\{y_{1}, y_{2}, \ldots , y_{n}, \ldots \}$, then

$$ P\left\{ \eta =y_{j}\right\} =P(\bigcup _{\begin{array}{c} i=1 \\ h(x_{i})=y_{j} \end{array}}^{+\infty }\{\xi =x_{i}\})=\sum _{\begin{array}{c} i=1\\ h(x_{i})=y_{j} \end{array}}^{+\infty } P\left\{ \xi =x_{i}\right\} . $$

By the definition of $E(\eta )$, then

$$\begin{aligned} \begin{aligned} E_ {h}(\xi )=E(\eta )&=\sum _{j=1}^{+\infty } y_{j} P\left\{ \eta =y_{j}\right\} \\&=\sum _{j=1}^{+\infty } y_{j} \sum _{\begin{array}{c} i=1\\ h(x_{i})=y_{j} \end{array}}^{+\infty } P\left\{ \xi =x_{i}\right\} \\&=\sum _{i=1}^{+\infty }(\sum _{\begin{array}{c} j=1\\ h(x_{i})=y_{j} \end{array}}^{+\infty } y_{j})P\left\{ \xi =x_{i}\right\} \\&=\sum _{i=1}^{+\infty }h(x_{i})p(x_{i}). \end{aligned} \end{aligned}$$

The same can be proved (2).

The following basic properties of mathematical expectation are easy to prove.

Lemma 1.16

(1) If $\xi =c$ is constant, then $E(\xi )=c$.

(2) If a and b are constants, then $E(a\xi +b\xi )=aE(\xi )+bE(\xi )$.

(3) If $a\le \xi \le b$, then $a\le E(\xi )\le b$.

If the mathematical expectation $E(\xi )$ of a random variable exists, then $(\xi -E\xi )^{2}$ is also a random variable (take $h(x)=(x-a)^{2}$, where $a=E(\xi )$), We define the mathematical expectation $E_h{(\xi )}$ of $h(\xi )$ as the variance of $\xi $, denoted as $D(\xi )$, that is

$$\begin{aligned} D(\xi )=E((\xi -E\xi )^{2}). \end{aligned}$$

Denote $\sigma =\sqrt{D(\xi )}$ is the standard deviation of $\xi $. Here are some basic properties about variance.

Lemma 1.17

(1) $D(\xi )=E(\xi ^2)-E^2(\xi )$.

(2) If $\xi =a$is constant, then $D(\xi )=0$.

(3) $D(\xi +c)=D(\xi )$.

(4) $D(c\xi )=c^2D(\xi )$.

(5) If $c\ne E\xi $, then $D(\xi )<E((\xi -c)^2)$.

Proof

(1) can be seen from the definition,

$$\begin{aligned} \begin{aligned} D(\xi )&=E((\xi -E\xi )^{2})\\&=E(\xi ^{2}-2\xi E\xi +E^{2}(\xi ))\\&=E(\xi ^{2})-2(E\xi )^{2}+(E(\xi ))^{2}\\&=E(\xi ^{2})-(E\xi )^{2}. \end{aligned} \end{aligned}$$

(2) is trivial. Let us prove (3). By (1),

$$\begin{aligned} \begin{aligned} D(\xi +c)&=E((\xi +c)^{2})-(E(\xi +c))^{2}\\&=E(\xi ^{2}+2c\xi +c^{2})-((E\xi )^{2}+2cE(\xi )+c^{2})\\&=E(\xi ^{2})+2cE(\xi )+c^{2}-(E\xi )^{2}-2cE(\xi )-c^{2}\\&=E(\xi ^{2})-(E\xi )^{2}=D(\xi ). \end{aligned} \end{aligned}$$

(4) can also be derived directly from (1). In fact,

$$\begin{aligned} \begin{aligned} D(c\xi )&=E(c^{2}\xi ^{2})-(E(c\xi ))^{2}\\&=c^{2}E(\xi ^{2})-c^{2}(E\xi )^{2}\\&=c^{2}D(\xi ). \end{aligned} \end{aligned}$$

To prove (5), from Lemma 1.16, we notice that the mathematical expectation of $(\xi -E\xi )$ is 0, so if $c\ne E(\xi )$, by (3), we have

$$D(\xi )=D(\xi -c)=E((\xi -c)^{2})-(E(\xi -c))^{2}.$$

Since the last term of the above formula is not zero, we always have

$$D(\xi )<E((\xi -c)^{2}).$$

(5) holds. This property indicates that $E((\xi -c)^{2})$ reaches the minimum value $D(\xi )$ at $c=E\xi $. We have completed the proof.

Now we give the main results of this section; in mathematics, it is called Chebyshev type inequality, which is essentially the so-called moment inequality, because the mathematical expectation $E\xi $ of a random variable $\xi $ is the first-order origin moment and the variance is the second-order moment.

Theorem 1.1

Let h(x) be a nonnegative real valued function of x, $\xi $ is a random variable, and expectation $E\xi $ exists, then for any $\varepsilon >0$, we have

$$\begin{aligned} P\{h(\xi )\ge \varepsilon \}\le \frac{E_{h}(\xi )}{\varepsilon }, \end{aligned}$$

(1.32)

and

$$\begin{aligned} P\{h(\xi )>\varepsilon \}<\frac{E_{h}(\xi )}{\varepsilon }. \end{aligned}$$

(1.33)

Proof

We prove the theorem only for continuous random variable $\xi $. Let f(x) be density function of $\xi $, then by Lemma 1.15,

$$\begin{aligned} \begin{aligned} E_{h}(\xi )&=\int \limits _{-\infty }^{+\infty }h(x)f(x)\mathrm {~d}x\\&\ge \int \limits _{\begin{array}{c} h(x)\ge \varepsilon \end{array}}h(x)f(x)\mathrm {~d}x\\&\ge \varepsilon \int \limits _{\begin{array}{c} h(x)\ge \varepsilon \end{array}}f(x)\mathrm {~d}x\\&=\varepsilon P\{h(x)\ge \varepsilon \}. \end{aligned} \end{aligned}$$

so (1.32) holds. Similarly, we can prove (1.33).

In the theorem, we can get different Chebyshev inequality by replacing $h(\xi )$ with $\xi -E\xi $.

Corollary 1.1

(Chebyshev) If the variance $D(\xi )$ of the random variable $\xi $ exists, then for any $\varepsilon >0$, we have

$$\begin{aligned} P\{|\xi -E\xi |\ge \varepsilon \}\le \frac{D(\xi )}{\varepsilon ^{2}}. \end{aligned}$$

(1.34)

Proof

Take $h(\xi )=(\xi -E\xi )^{2}$ in Theorem 1.1, then $|\xi -E\xi |\ge \varepsilon $ if and only if $h(\xi )\ge \varepsilon ^{2}$, from definition, $E_{h}(\xi )=D(\xi )$. thus

$$\begin{aligned} \begin{aligned} P\{|\xi -E\xi |\ge \varepsilon \}=P\{h(\xi )\ge \varepsilon ^{2}\}\le \frac{E_{h}(\xi )}{\varepsilon ^{2}}. \end{aligned} \end{aligned}$$

The Corollary holds.

Corollary 1.2

(Chebyshev) Suppose that both the expected value $E\xi $ and the variance $D(\xi )$ of the random variable $\xi $ exist, then for any $k>0$, we have

$$\begin{aligned} P\{|\xi -E\xi |\ge k\sqrt{D(\xi )}\}\le \frac{1}{k^{2}}. \end{aligned}$$

(1.35)

Proof

Take $\varepsilon =k\sqrt{D(\xi )}$ in Corollary 1.1, then

$$\begin{aligned} \begin{aligned} P\{|\xi -E\xi |\ge k\sqrt{D(\xi )}\}\le \frac{D(\xi )}{k^{2}D(\xi )}=\frac{1}{k^{2}}. \end{aligned} \end{aligned}$$

Corollary 1.2 holds.

In mathematics, $\mu $ is often used as the expected value, $\sigma =\sqrt{D(\xi )}(\sigma \ge 0)$ as the standard deviation, that is

$$ \mu =E\xi ,~\sigma =\sqrt{D(\xi )},~\sigma \ge 0. $$

Then the Chebyshev inequality in the Corollary 1.2 can be written as follows:

$$\begin{aligned} P\{|\xi -\mu |\ge k\sigma \}\le \frac{1}{k^{2}}. \end{aligned}$$

(1.36)

Corollary 1.3

(Markov) If the expected value of the random variable $\xi $ satisfying the positive integer $|\xi |^{k}$ of $k\ge 1$ exists, then

$$\begin{aligned} \begin{aligned} P\{|\xi |\ge \varepsilon \}\le \frac{E{|\xi |^{k}}}{\varepsilon ^{k}}. \end{aligned} \end{aligned}$$

Proof

Take $h(\xi )=|\xi |^{k}$ in Theorem 1.1, Replace $\varepsilon $ with $\varepsilon ^{k}$, then the Markov inequality is directly derived from Theorem 1.1.

Next, we introduce several common discrete random variables and their probability distribution and calculate their expected value and variance.

Example 1.2

(Degenerate distribution) A random variable $\xi $ takes a constant a with probability 1, that is $\xi =a$, $P\{\xi =a\}=1$, $\xi $ is called degenerate distribution. From Lemma 1.16, (1), $E\xi =a$, its variance is $D(\xi )=0$.

Example 1.3

(Two point distribution) A random variable $\xi $ has only two values $\{x_{1},x_{2}\}$, and its probability distribution is

$$\begin{aligned} \begin{aligned} P\{\xi =x_{1}\}=p,~P\{\xi =x_{2}\}=1-p,~0<p<1. \end{aligned} \end{aligned}$$

$\xi $ is called a two-point distribution with parameter p, and its mathematical expectation and variance are

$$\begin{aligned} \begin{aligned} {\left\{ \begin{aligned}&E(\xi )=x_{1}p+x_{2}(1-p),\\&D(\xi )=p(1-p)(x_{1}-x_{2})^{2}. \end{aligned} \right. } \end{aligned} \end{aligned}$$

Specially, take $x_{1}=1,x_{2}=0$, then the expected value and variance of the two-point distribution are

$$\begin{aligned} E(\xi )=p, ~D(\xi )=p(1-p). \end{aligned}$$

Example 1.4

(Equal probability distribution) Let a random variable $\xi $ have n values $\{x_{1},x_{2}, \ldots , x_{n}\}$ and be equal probability distribution, that is

$$\begin{aligned} \begin{aligned} P\{\xi =x_{i}\}=\frac{1}{n},1\le i\le n. \end{aligned} \end{aligned}$$

$\xi $ is called a equal probability distribution or uniform distribution with obeying n points ${x_{1},x_{2}, \ldots , x_{n}}$. The expected value and variance are

$$\begin{aligned} \begin{aligned} E(\xi )=\frac{1}{n}\sum _{i=1}^{n}x_{i},~D(\xi )=\frac{1}{n}\sum _{i=1}^{n}(x_{i}-E(\xi ))^{2}. \end{aligned} \end{aligned}$$

Example 1.5

(Binomial distribution) In the n-fold Bernoulli experiment, the number of times $\xi $ of event a is a random variable from 0 to n. The probability distribution is $(\text {see Bernoulli experiment})$

$$\begin{aligned} \begin{aligned} P\{\xi =k\}=b(k;n,p)=\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{k}q^{n-k}, \end{aligned} \end{aligned}$$

where $0\le k\le n$, p is the probability of event a occurring in each experiment. $\xi $ is called a binomial distribution with parameter n, p, denotes as $\xi \sim b(n,p)$. In fact, b(k; n, p) is the expansion of binomial $(p+q)^{n}$.

Lemma 1.18

Let $\xi \sim b(n,p)$, then

$$\begin{aligned} E(\xi )=np,~D(\xi )=npq,~q=1-p. \end{aligned}$$

Proof

By definition,

$$\begin{aligned} \begin{aligned} E(\xi )&=\sum _{k=0}^{n}kb(k;n,p)=\sum _{k=1}^{n}k\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{k}q^{n-k}\\&=np\sum _{k=1}^{n}\left( {\begin{array}{c}n-1\\ k-1\end{array}}\right) p^{k-1}q^{(n-1)-(k-1)}\\&=np\sum _{k=0}^{n-1}\left( {\begin{array}{c}n-1\\ k\end{array}}\right) p^{k}q^{n-1-k}\\&=np\sum _{k=0}^{n-1}b(k;n-1,p)\\&=np. \end{aligned} \end{aligned}$$

Similarly, it can be calculated

$$\begin{aligned} \begin{aligned} E(\xi ^{2})=\sum _{k=0}^{n}k^{2}b(k;n,p)=n^{2}p^{2}+npq. \end{aligned} \end{aligned}$$

thus

$$\begin{aligned} \begin{aligned} D(\xi )=E(\xi ^{2})-(E(\xi ))^{2}=npq. \end{aligned} \end{aligned}$$

We have completes the proof.

Lemma 1.19

$p_{n}$ is the probability of event a in the n-fold Bernoulli experiment. If $np_{n}\rightarrow \lambda $, then we have

$$\begin{aligned} \lim _{n\rightarrow \infty }b(k;n,p_{n})=\frac{\lambda ^{k}}{k!}e^{-\lambda }. \end{aligned}$$

Proof

Write $\lambda _{n}=np_{n}$, then

$$\begin{aligned} \begin{aligned} b(k;n,p_{n})&=\left( {\begin{array}{c}n\\ k\end{array}}\right) (p_{n})^{k}(1-p_{n})^{n-k}\\&=\frac{n(n-1)\cdots (n-(k-1))}{k!}(\frac{\lambda _{n}}{n})^{k}(1-\frac{\lambda _{n}}{n})^{n-k}\\&=\frac{(\lambda _{n})^{k}}{k!}(1-\frac{1}{n})\cdots (1-\frac{k-1}{n})(1-\frac{\lambda _{n}}{n})^{n-k}. \end{aligned} \end{aligned}$$

Because for fixed k, there is $\lim \limits _{n\rightarrow \infty }(\lambda _{n})^{k}=\lambda ^{k}$, and

$$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }(1-\frac{\lambda _{n}}{n})^{n-k}=e^{-\lambda }, \end{aligned} \end{aligned}$$

also

$$\begin{aligned} \lim _{n\rightarrow \infty }(1-\frac{1}{n})(1-\frac{2}{n})\cdots (1-\frac{k-1}{n})=1. \end{aligned}$$

So there are

$$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }b(k;n,p_{n})=\frac{\lambda ^{k}}{k!}e^{-\lambda }. \end{aligned} \end{aligned}$$

So Lemma 1.19 holds.

Example 1.6

(Possion distribution) The value of discrete random variable $\xi $ is $0,1,\ldots ,n,\ldots $, $\lambda \ge 0$ is a nonnegative real number, if the probability distribution of $\xi $ is

$$\begin{aligned} P\{\xi =k\}=p(k,\lambda )=\frac{\lambda ^{k}}{k!}e^{-\lambda }, \end{aligned}$$

$\xi $ is called a random variable which obeys Poisson distribution. It can be proved that the expected value and variance of Poisson distribution $\xi $ are $\lambda $. When p is very small, the random variable $\xi _{n}$ of n-fold Bernoulli experiment can be considered to be close to the Poisson distribution $\xi $. In this case, the probability distribution function b(k; n, p) can be approximately replaced by the possion distribution, that is

$$\begin{aligned} b(k;n,p)\approx \frac{(np)^{k}}{k!}e^{-np}. \end{aligned}$$

1.7 Stochastic Process

The so-called stochastic process is to consider the statistical characteristics of a consistent random variable $\{\xi _i\}_{i=1}^{n}$. We can describe it as a n dimensional random vector. Let $\{\xi _i\}_{i=1}^{n}$ be n compatible random variables of a given probability space, $\xi =(\xi _1,\xi _2,\ldots ,\xi _n)$ is called an n-dimensional random vector with values in $\mathbb {R}^n$ in the probability space.

A stochastic process or a n dimensional random vector $\xi =(\xi _1,\xi _2,\ldots ,\xi _n)$ is uniquely determined by the occurrence probability of the following joint events. Let $A(\xi _i)\subset \mathbb {R}$ be the value space of random variable $\xi _i(1\le i\le n)$; then for any $(x_1,x_2, \ldots , x_n)\in A(\xi _1)\times A(\xi _2)\times \cdots \times A(\xi _n)\subset \mathbb {R}^n$, the probability of occurrence of the following joint event is denoted as

$$\begin{aligned} p(x_1x_2\cdots x_n)=p((x_1,x_2,\ldots ,x_n)) =P\{ \xi _1=x_1,\xi _2=x_2,\ldots ,\xi _n=x_n \}. \end{aligned}$$

Definition 1.7

If for any $x_i\in \mathbb {R}(1\le i\le n)$, we have

$$\begin{aligned} p(x_1x_2\cdots x_n)=p(x_1)p(x_2)\cdots p(x_n). \end{aligned}$$

Called stochastic process $\{\xi _i\}_{i=1}^{n}$ is statistically independent.

Strictly speaking, each real number $x_i$ in the Definition 1.7 should belong to the set of Borel on the line to ensure the event $\{ \xi _i=x_i \}$ generated by $\xi _i$ is the event in a given probability space.

Similarly, we can define a vector function $F(x_1, x_2, \ldots , x_n)$ in $\mathbb {R}^n$ as

$$\begin{aligned} F(x_1, x_2, \ldots , x_n)=P\{\xi _1\le x_1, \xi _2\le x_2, \ldots ,\xi _n\le x_n\}, \end{aligned}$$

This is the distribution function of random vector $\xi =(\xi _1,\xi _2,\ldots ,\xi _n)$. Its marginal distribution function is

$$\begin{aligned} F_i(x_i)=P\{ \xi _i\le x_i \}=F(+\infty , +\infty ,\ldots , x_i,+\infty ,\ldots ,+\infty ). \end{aligned}$$

For the following properties of stochastic process, we do not give any proof. The reader can find them in the classical probability theory textbook (see reference 1’s R$\acute{e}$nyi 1970, Li 2010, Long 2020).

Lemma 1.20

(1) A stochastic process $\{ \xi _i \}_{i=1}^{n}$ is statistically independent if and only if

$$\begin{aligned} F(x_1x_2\cdots x_n)=F(x_1)F(x_2)\cdots F(x_n). \end{aligned}$$

(2) Suppose $\{\xi _i\}_{i=1}^{n}$ is statistically independent, for any real value function $g_i(x)$, then $\{g_i(\xi _i)\}_{i=1}^{n}$ is also statistically independent.

(3) If $\xi _i$ is n random variables, then

$$\begin{aligned} E(\xi _1+\xi _2+\cdots + \xi _n)=E(\xi _1)+E(\xi _2)+\cdots + E(\xi _n). \end{aligned}$$

(4) If $\{\xi _i\}_{i=1}^{n}$ is statistically independent, the expected value $E(\xi _i)$ of each random variable existence, then the mathematical expectation of random variable $\xi =(\xi _1,\xi _2,\ldots ,\xi _n)$ exists, and

$$\begin{aligned} E(\xi )=E((\xi _1, \xi _2, \ldots , \xi _n))=E(\xi _1)E(\xi _2)\cdots E(\xi _n). \end{aligned}$$

Definition 1.8

Let $\{\xi _i\}_{i=1}^{\infty }$ be a series of random variables, $\xi $ is a given random variable, if for any $\xi >0$, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }P\{|\xi _n-\xi |>\varepsilon \}=0, \end{aligned}$$

it is called $\{\xi _n\}$ converges to $\xi $ in probability, denoted as $\xi _n {\mathop {\longrightarrow }\limits ^{P}} \xi $.

Obviously, $\xi _n {\mathop {\longrightarrow }\limits ^{P}} \xi $ if and only if for any $\varepsilon >0$, there is

$$\begin{aligned} \lim _{n\rightarrow \infty }P\{|\xi _n-\xi |\le \varepsilon \}=1. \end{aligned}$$

If the occurrence probability of an event is p, the frequency of the event in the statistical test gradually approaches its probability p. Strict mathematical statements and proof are attributed to the Bernoulli law of large numbers.

Theorem 1.2

(Bernoulli) Let $\mu _{n}$ be the number of occurrences of event a in the n-fold Bernoulli experiment, it is known that the probability of occurrence of a in each experiment is $p(0<p<1)$, then the frequency $\{\frac{\mu _{n}}{n}\}$ of a converges in probability to p, that is, for any $\varepsilon >0$, there is

$$\begin{aligned} \lim _{n\rightarrow \infty }P\{|\frac{\mu _{n}}{n}-p|>\varepsilon \}=0. \end{aligned}$$

Proof

Consider $\frac{\mu _{n}}{n}$ as a random variable, its expected value and variance are

$$ E(\frac{\mu _{n}}{n})=\frac{1}{n}E(\mu _{n})=p $$

and

$$ D(\frac{\mu _{n}}{n})=\frac{1}{n^2}D(\mu _{n})=\frac{pq}{n},~ q=1-p. $$

respectively. By Chebyshev inequality (1.34), we have

$$\begin{aligned} P\{|\frac{\mu _{n}}{n}-p|>\varepsilon \}<\frac{pq}{n\varepsilon ^{2}}. \end{aligned}$$

For any given $\varepsilon >0$, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }P\{|\frac{\mu _{n}}{n}-p|>\varepsilon \}=0. \end{aligned}$$

So Bernoulli’s law of large numbers holds.

In order to better understand Bernoulli’s law of large numbers, we can use a random process to describe it. Define

$$\begin{aligned} \xi _{i}={\left\{ \begin{aligned}&1, \ \ \ \ \ \ \ \text {if event} \; a \; \text {occurs in the}\, i \text {-th experiment}.\\&0,\ \ \ \ \ \ \ \text {if event} \; a \; \text {does not occur in the}\, i \text {-th experiment}. \end{aligned} \right. } \end{aligned}$$

Then $\xi _{i}$ follows a two-point distribution with parameter p (see Sect. 1.6, example 1.3), and $\{\xi _{i}\}_{i=1}^{+\infty }$ is an independent and identically distributed stochastic process. Obviously,

$$\begin{aligned} \mu _{n}=\sum _{i=1}^{n}\xi _{i}, ~E(\xi _{i})=p. \end{aligned}$$

So Bernoulli’s law of large numbers can be rewritten as follows:

$$\begin{aligned} \lim _{n\rightarrow \infty }P\left\{ |\frac{1}{n}\sum _{i=1}^{n}\xi _{i}-\frac{1}{n}E (\sum _{i=1}^{n}\xi _{i})|<\varepsilon \right\} =1, \end{aligned}$$

where $\{\xi _{i}\}$ is a sequence of independent random variables with the same two-point distribution of $0-1$ with parameter p. It is not difficult to generalize this conclusion to a more general case.

Theorem 1.3

(Chebyshev’s law of large numbers) Let $\{\xi _{i}\}_{i=1}^{+\infty }$ be a series of independent random variables, their expected value $E(\xi _{i})$ and variance $D(\xi _{i})$ exist, and the variance is bounded, i.e., $D(\xi _{i})\le C$ holds for any $i\ge 1$, then for any $\varepsilon >0$, we have

$$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }P\{|\frac{1}{n}\sum _{i=1}^{n}\xi _{i}-\frac{1}{n}\sum _{i=1}^{n}E(\xi _{i})|<\varepsilon \}=1. \end{aligned} \end{aligned}$$

Proof

By Chebyshev inequality,

$$\begin{aligned} \begin{aligned}&P\{|\frac{1}{n}\sum _{i=1}^{n}\xi _{i}-\frac{1}{n}E(\sum _{i=1}^{n}\xi _{i})|\ge \varepsilon \}\\&\le \frac{D(\frac{1}{n}\sum _{i=1}^{n}\xi _{i})}{\varepsilon ^{2}}\\&=\frac{D(\sum _{i=1}^{n}\xi _{i})}{n^{2}\varepsilon ^{2}}\\&=\frac{1}{n^{2}\varepsilon ^{2}}\sum _{i=1}^{n}D(\xi _{i})\\&\le \frac{C}{n\varepsilon ^{2}}. \end{aligned} \end{aligned}$$

So there are

$$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }P\{|\frac{1}{n}\sum _{i=1}^{n}\xi _{i}-\frac{1}{n}\sum _{i=1}^{n}E(\xi _{i})|\ge \varepsilon \}=0. \end{aligned} \end{aligned}$$

That is, Theorem 1.3 holds.

Chebyshev’s law of large numbers is more general than Bernoulli’s law of large numbers, it can be understood as a sequence of independent random variables $\{\xi _{i}\}$, the arithmetic mean of a random variable converges to the arithmetic mean of its expected value in probability.

As a special case, we consider an independent identically distributed stochastic process $\{\xi _{i}\}$. Because there is the same probability distribution, there is the same expectation and variance.

Corollary 1.4

Let $\{\xi _{i}\}$ be an independent and identically distributed random process, their common expectation is $\mu $, the variance is $\sigma ^{2}$, that is $E(\xi _{i})=\mu ,D(\xi _{i})=\sigma ^{2}(i=1,2,\ldots )$, then we have

$$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }P\{|\frac{1}{n}\sum _{i=1}^{n}\xi _{i}-\mu |<\varepsilon \}=1, \end{aligned} \end{aligned}$$

that is $\frac{1}{n}\sum _{i=1}^{n}\xi _{i}{\mathop {\longrightarrow }\limits ^{P}} \mu $.

In the above Corollary, the existence of variance is unnecessary, Sinchin proved an independent and identically distributed stochastic process $\{\xi _{i}\}$, as long as the expected value $E(\xi _{i})=\mu $ exists. Then $\frac{1}{n}\sum _{i=1}^{n}\xi _{i}$ converges to its expected value in probability. This conclusion is called Sinchin’s law of large numbers.

Finally, we state the so-called Lindbergh Levy’s central limit theorem without proof.

Theorem 1.4

(central limit theorem) Let $\{\xi _{i}\}_{i=1}^{+\infty }$ is an independent and identically distributed stochastic process, the expected value is $E(\xi _{i})=\mu $, the variance is $D(\xi _{i})=\sigma ^{2}>0(i=1,2,\ldots )$, then for any x, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }P\{\frac{\sum _{i=1}^{n}\xi _{i}-n\mu }{\sigma \sqrt{n}}\le x\}=\frac{1}{\sqrt{2\pi }}\int \limits _{-\infty }^{x} e^{-\frac{t^{2}}{2}} \mathrm {d} t, \end{aligned}$$

That is, the sum of random variables $\sum _{i=1}^{n}\xi _{i}$, whose standardized variables converge to the standard normal distribution N(0, 1) in probability.

Exercise 1 (Nie and Ding 2000)

Let A, B, C be three nonempty sets, $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ is the given mapping, $\tau _{1}$ and $\tau _{2}$ are any two mappings of $B\rightarrow C$. Prove: if $\sigma $ is surjective and $\tau _{1}\sigma =\tau _{2}\sigma $, then $\tau _{1}=\tau _{2}$.

Let $\tau _{1}$ and $\tau _{2}$ be any two mappings of $A\rightarrow B$, $\sigma $ is the given mapping of $B\rightarrow C$. Prove: if $\sigma $ is injective and $\sigma \tau _{1}=\sigma \tau _{2}$, then $\tau _{1}=\tau _{2}$.

Let $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ be a injective, $\tau :B\rightarrow A$ is the left inverse of $\sigma $, Is the left inverse $\tau $ of $\sigma $ unique?

Let $A{\mathop {\longrightarrow }\limits ^{\sigma }}B$ be a surjective, Is the right inverse of $\sigma $ unique?

Suppose that a, m, n are integers, $a\ge 0, m\ge 1, n\ge 1$, prove

$$(a^{2^{m}}+1,a^{2^{n}}+1)=1~\text {or}~ 2.$$

Thus prove Polya theorem: there are infinitely many primes.

On the positive integer set, the Möbius function $\mu (n)$ is defined as

$$ \mu (n)= {\left\{ \begin{array}{ll}1,\quad \ \ &{} \text {when} ~n=1,\\ 0,\quad \ \ &{} \text {when}~n~\text {contain square factor},\\ (-1)^{t},\quad \ \ &{} \text {when}~n=p_{1}p_{2}\cdots p_{t}, p_{i}\text {are different primes}. \end{array}\right. } $$

Prove Möbius identity

$$ \sum _{d|n}\mu (d)= {\left\{ \begin{array}{ll}1,\quad \ \ &{} \text {when} ~n=1,\\ 0,\quad \ \ &{} \text {when} ~n>1. \end{array}\right. } $$

Suppose $\varphi (n)$ is a Euler function, prove

$$\varphi (n)=n\sum _{d|n} \frac{\mu (d)}{d}.$$

Let $n\ge 1$ be positive integer, prove Wilson theorem:

$$(n-1)!+1\equiv 0({{\,\mathrm{mod}\,}}n)$$

if and only if n is prime.

Let n and b be positive integers, $n>b$, prove n can be uniquely expressed as the following b-ary number:

$$n=b_{0}+b_{1}b+b_{2}b^{2}+\cdots +b_{r-1}b^{r-1}, \text {where}~0\le b_{i}<b, r\ge 1.$$

$n=(b_{r-1}b_{r-2}\cdots b_{1}b_{0})_{b}$ is called the b-ary expression of n and r is called the b-ary digit of n.

10.

Let f(n) be a complex valued function on a set of positive integers, and prove the inversion formula of Möbius:

$$F(n)=\sum _{d|n}f(d), ~\forall n\ge 1\Leftrightarrow f(n)=\sum _{d|n} \mu (d)F(\frac{n}{d}), ~\forall n\ge 1.$$

11.

Prove that the following sum formula:

$$\sum _{\begin{array}{c} 1\le r\le n \\ (r,n)=1 \end{array}}r=\frac{n\varphi (n)}{2}.$$

12.

Prove: There are infinitely many primes p satisfies $p\equiv -1({{\,\mathrm{mod}\,}}6)$.

13.

Solve the congruence equation: $27x\equiv 25({{\,\mathrm{mod}\,}}31)$.

14.

Let p be a prime, $n\ge 1$ be a positive integer, find the number of solutions of quadratic congruence equation $x^{2}\equiv 1({{\,\mathrm{mod}\,}}p^{n})$.

15.

In order to reduce the number of games, 20 teams were divided into two groups, each with 10 team, find the probability that the strongest two teams will be in the same group, and the probability of the strongest two teams in different groups.

16.

(Banach question). A mathematician has two boxes of matches. Each box has N matches. When he uses them, he takes one match from any box and calculates the probability that one box has k Matches and the other box is empty.

17.

A stick of length l can break at any two points, find the probability that the three pieces of the stick can form a triangle.

18.

There are k jars, each containing n balls, numbered from 1 to n. Now take any ball from each jar and ask the probability of that m is the largest number in the ball.

19.

Take any three of the five numbers of 1, 2, 3, 4, 5 and arrange them from small to large. Let X denote the number in the middle and find the probability distribution of X.

20.

Let F(x) be a distribution function of a continuous random variable, $a>0$, prove

$$ \int \limits _{-\infty }^{+\infty } |F(x+a)-F(x)|dx =a.$$

21.

(Generalization of Bernoulli’s law of large numbers) Let $\mu _{n}$ is the number of occurrences of event A in the first n experiments of a series of independent Bernoulli experiments, it is known that the probability of occurrence of event A in the i test is $p_{i}$, try to write the corresponding law of large numbers and prove it .

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Springer Professional

Abstract

1.1 Injective

1.2 Computational Complexity

1.3 Jensen Inequality

1.4 Stirling Formula

1.5 n-fold Bernoulli Experiment

1.6 Chebyshev Inequality

1.7 Stochastic Process