2020  OriginalPaper  Buchkapitel
Tipp
Weitere Kapitel dieses Buchs durch Wischen aufrufen
Erschienen in:
Statistical Learning with Math and R
In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as \(\pm 1\), figures \(0,1,\ldots ,9\). For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate in the test data after constructing a classifier based on the training data. The second approach is to draw borders that separate the regions of the responses with linear and quadratic discriminators and the knearest neighbor algorithm. The linear and quadratic discriminations draw linear and quadratic borders, respectively, and both introduce the notion of prior probability to minimize the average error probability. The knearest neighbor method searches the border more flexibly than the linear and quadratic discriminators. On the other hand, we take into account the balance of two risks, such as classifying a sick person as healthy and classifying a healthy person as unhealthy. In particular, we consider an alternative approach beyond minimizing the average error probability. The regression method in the previous chapter and the classification method in this chapter are two significant issues in the field of machine learning.
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
Anzeige
19.
We assume that there exist
\(\beta _0\in {\mathbb R}\) and
\(\beta \in {\mathbb R}^p\) such that for
\(x\in {\mathbb R}^p\), the probabilities of
\(Y=1\) and
\(Y=1\) are
\(\displaystyle \frac{e^{\beta _0+x\beta }}{1+e^{\beta _0+x\beta }}\) and
\(\displaystyle \frac{1}{1+e^{\beta _0+x\beta }}\), respectively. Show that the probability of
\(Y=y\in \{1,1\}\) can be written as
\(\displaystyle \frac{1}{1+e^{y(\beta _0+x\beta )}}\).
20.
For
\(p=1\) and
\(\beta >0\), show that the function
\(f(x)=\displaystyle \frac{1}{1+e^{(\beta _0+x\beta )}}\) is monotonically increasing for
\(x\in {\mathbb R}\) and convex and concave in
\(x<\beta _0/\beta \) and
\(x>\beta _0/\beta \), respectively. How does the function change as
\(\beta \) increases? Execute the following to answer this question.
×
21.
We wish to obtain the estimates of
\(\beta _0\in {\mathbb R}\) and
\(\ \beta \in {\mathbb R}^p\) by maximizing the likelihood
\(\displaystyle \prod _{i=1}^N\frac{1}{1+e^{y_i(\beta _0+x_i\beta )}}\), or equivalently, by minimizing the negated logarithm
from observations
\((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^{p}\times \{1,1\}\) (maximum likelihood). Show that
\(l(\beta _0,\beta )\) is convex by obtaining the derivative
\(\nabla l(\beta _0,\beta )\) and the second derivative
\(\nabla ^2 l(\beta _0,\beta )\). Hint: Let
\(\nabla l(\beta _0,\beta )\) and
\(\nabla ^2 l(\beta _0,\beta )\) be the column vector of size
\(p+1\) such that the
jth element is
\(\displaystyle \frac{\partial l}{\partial \beta _j}\) and the matrix of size
\((p+1)\times (p+1)\) such that the (
j,
k)th element is
\(\displaystyle \frac{\partial ^2 l}{\partial \beta _j\partial \beta _k}\), respectively. Simply show that the matrix is nonnegative definite. To this end, show that
\(\nabla ^2 l(\beta _0,\beta )=X^TWX\). If
W is diagonal, then it can be written as
\(W=U^TU\), where the diagonal elements of
U are the square roots of
W, which means
\(\nabla ^2 l(\beta _0,\beta )=(UX)^TUX\).
$$l(\beta _0,\beta )=\sum _{i=1}^N\log (1+v_i) ,\ v_i=e^{y_i(\beta _0+x_i\beta )}$$
22.
Solve the following equations via the Newton–Raphson method by constructing an R program.
(a)
For
\(f(x)=x^21\), set
\(x=2\) and repeat the recursion
\(x\leftarrow xf(x)/f'(x)\) 100 times.
(b)
For
\(f(x,y)=x^2+y^21\),
\(g(x,y)=x+y\), set
\((x,y)=(1,2)\) and repeat the recursion 100 times.
Hint: Define the procedure and repeat it 100 times.
$$ \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \leftarrow \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \left[ \begin{array}{c@{\quad }c} \displaystyle \frac{\partial f(x,y)}{\partial x}&{}\displaystyle \frac{\partial f(x,y)}{\partial y}\\[3mm] \displaystyle \frac{\partial g(x,y)}{\partial x}&{}\displaystyle \frac{\partial g(x,y)}{\partial y}\\ \end{array} \right] ^{1} \left[ \begin{array}{c} f(x,y)\\ g(x,y)\\ \end{array} \right] $$
×
23.
We wish to solve
\(\nabla l(\beta _0,\beta )=0\),
\((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) in Problem 21 via the Newton–Raphson method using the recursion
where
\(\nabla f(v) \in {\mathbb R}^{p+1}\) and
\(\nabla ^2 f(v) \in {\mathbb R}^{(p+1)\times (p+1)}\) are the vector such that the
ith element is
\(\displaystyle \frac{\partial f}{\partial v_i}\) and the square matrix such that the (
i,
j)th element is
\(\displaystyle \frac{\partial ^2 f}{\partial v_i\partial v_j}\), respectively. In the following, for ease of notation, we write
\((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) by
\(\beta \in {\mathbb R}^{p+1}\). Show that the update rule can be written as follows:
where
\(u \in {\mathbb R}^{p+1}\) such that
\(\nabla l(\beta _{ old})=X^Tu\) and
\(W\in {\mathbb R}^{(p+1)\times (p+1)}\) such that
\(\nabla ^2 l(\beta _{ old})=X^TW X\),
\(z\in {\mathbb R}\) is defined by
\(z:=X\beta _{ old}+W^{1}u\), and
\(X^TWX\) is assumed to be nonsingular. Hint: The update rule can be written as follows:
\(\beta _{ new}\leftarrow \beta _{ old}+(X^TWX)^{1}X^Tu\).
$$(\beta _0,\beta )\leftarrow (\beta _0,\beta )\{\nabla ^2 l(\beta _0,\beta )\}^{1}\nabla l(\beta _0,\beta )\ ,$$
$$\begin{aligned} \beta _{ new}\leftarrow (X^TW X)^{1}X^TW z\ , \end{aligned}$$
(3.5)
24.
We construct a procedure to solve Problem 23. Fill in blanks (1)(2)(3), and examine that the procedure works.
×
25.
If the condition
\(y_i(\beta _0+x_i\beta )\ge 0,\ (x_i,y_i)\in {\mathbb R}^{p}\times {\mathbb R},\ i=1,\ldots ,N\) is met, we cannot obtain the parameters of logistic regression via maximum likelihood. Why?
26.
For
\(p=1\), we wish to estimate the parameters of logistic regression from
N/2 training data and to predict the responses of the
N/2 test data that are not used as the training data. Fill in the blanks and execute the program.
Hint: For prediction, see whether
\(\beta _0+x\beta _1\) is positive or negative.
×
27.
In linear discrimination, let
\(\pi _k\) be the prior probability of
\(Y=k\) for
\(k=1,\ldots ,m\) (
\(m\ge 2\)), and let
\(f_k(x)\) be the probability density function of the
p covariates
\(x\in {\mathbb R}^p\) given response
\(Y=k\) with mean
\(\mu _k\in {\mathbb R}^p\) and covariance matrix
\(\Sigma _k\in {\mathbb R}^{p\times p}\). We consider the set
\(S_{k,l}\) of
\(x\in {\mathbb R}^p\) such that
for
\(k,l=1,\ldots , m, k\not =l.\)
$$\frac{\pi _k f_k(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}=\frac{\pi _l f_l(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}$$
(a)
Show that when
\(\pi _k=\pi _l\),
\(S_{k,l}\) is the set of
\(x\in {\mathbb R}^p\) on the quadratic surface
$$ (x\mu _k)^T\Sigma ^{1}_k(x\mu _k)+(x\mu _l)^T\Sigma ^{1}_l(x\mu _l)=\log \frac{\det \Sigma _k}{\det \Sigma _l}\ . $$
(b)
Show that when
\(\Sigma _k=\Sigma _l\) (
\(=\Sigma \)),
\(S_{k,l}\) is the set of
\(x\in {\mathbb R}^p\) on the surface
\(a^Tx+b=0\) with
\(a\in {\mathbb R}^p\) and
\(b\in {\mathbb R}\)) and express
a,
b using
\(\mu _k,\mu _l,\Sigma ,\pi _k,\pi _l\).
(c)
When
\(\pi _k=\pi _l\) and
\(\Sigma _k=\Sigma _l\), show that the surface of (b) is
\(x=(\mu _k+\mu _l)/2\).
28.
In the following, we wish to estimate distributions from two classes and draw a boundary line that determines the maximum posterior probability. If the covariance matrices are assumed to be equal, how do the boundaries change? Modify the program.
Hint: Modify the lines marked with
#.
×
29.
Even in the case of three or more values, we can select the class that maximizes the posterior probability. From four covariates (length of sepals, width of sepals, length of petals, width of petals) of Fisher’s iris data, we wish to identify the three types of irises (Setosa, Versicolor, and Virginica) via quadratic discrimination. Specifically, we learn rules from training data and evaluate them with test data. Assuming
\(N=150\) and
\(p=4\), each of the three irises contains 50 samples, and the prior probability is expected to be equal to 1/3. If we find that the prior probabilities of Setosa, Versicolor, and Virginica irises are 0.5, 0.25, 0.25, respectively how should the program be changed to determine the maximum posterior probability?
×
30.
In the
knearest neighbor method, we do not construct a specific rule from training data
\((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^{p}\times \)(finite set). Suppose that given a new data
\(x_*\),
\(x_i\),
\(i\in S\) are the
k training data such that the distances between
\(x_i\) and
\(x_*\) are the smallest, where
S is a subset of
\(\{1,\ldots ,n\}\) of size
k. The
knearest neighbor method predicts the response
\(y_*\) of
\(x_*\) by majority voting of
\(y_i\),
\(i \in S\). If there is more than one majority, we remove the
\(i \in S\) such that the distance between
\(x_j\) and
\(x_*\) is the largest among
\(x_j\),
\(j \in S_j\) and continue to find the majority. If
S contains exactly one element, we obtain the majority. The following process assumes that there is one test data, but the method can be extended to cases where there is more than one test data. Then, apply the method to the data in Problem 29.
×
31.
Let
\(f_1(x)\) and
\(f_0(x)\) be the distributions for a measurement
x of people with a disease and those without the disease, respectively. For each positive
\(\theta \), the decision that the person had the symptoms was determined according to whether
In the following, we suppose that the distributions of sick and healthy people are
N(1, 1) and
\(N(1,1)\), respectively. Fill in the blank and draw the ROC curve.
$$\frac{f_1 (x)}{f_0 (x)} \ge \theta \ .$$
×
×
We wish to obtain the estimates of
\(\beta _0\in {\mathbb R}\) and
\(\ \beta \in {\mathbb R}^p\) by maximizing the likelihood
\(\displaystyle \prod _{i=1}^N\frac{1}{1+e^{y_i(\beta _0+x_i\beta )}}\), or equivalently, by minimizing the negated logarithm
$$l(\beta _0,\beta )=\sum _{i=1}^N\log (1+v_i) ,\ v_i=e^{y_i(\beta _0+x_i\beta )}$$
(a)
For
\(f(x)=x^21\), set
\(x=2\) and repeat the recursion
\(x\leftarrow xf(x)/f'(x)\) 100 times.
(b)
For
\(f(x,y)=x^2+y^21\),
\(g(x,y)=x+y\), set
\((x,y)=(1,2)\) and repeat the recursion 100 times.
Hint: Define the procedure and repeat it 100 times.
$$ \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \leftarrow \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \left[ \begin{array}{c@{\quad }c} \displaystyle \frac{\partial f(x,y)}{\partial x}&{}\displaystyle \frac{\partial f(x,y)}{\partial y}\\[3mm] \displaystyle \frac{\partial g(x,y)}{\partial x}&{}\displaystyle \frac{\partial g(x,y)}{\partial y}\\ \end{array} \right] ^{1} \left[ \begin{array}{c} f(x,y)\\ g(x,y)\\ \end{array} \right] $$
×
$$ \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \leftarrow \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \left[ \begin{array}{c@{\quad }c} \displaystyle \frac{\partial f(x,y)}{\partial x}&{}\displaystyle \frac{\partial f(x,y)}{\partial y}\\[3mm] \displaystyle \frac{\partial g(x,y)}{\partial x}&{}\displaystyle \frac{\partial g(x,y)}{\partial y}\\ \end{array} \right] ^{1} \left[ \begin{array}{c} f(x,y)\\ g(x,y)\\ \end{array} \right] $$
×
We wish to solve
\(\nabla l(\beta _0,\beta )=0\),
\((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) in Problem 21 via the Newton–Raphson method using the recursion
$$(\beta _0,\beta )\leftarrow (\beta _0,\beta )\{\nabla ^2 l(\beta _0,\beta )\}^{1}\nabla l(\beta _0,\beta )\ ,$$
$$\begin{aligned} \beta _{ new}\leftarrow (X^TW X)^{1}X^TW z\ , \end{aligned}$$
(3.5)
×
If the condition
\(y_i(\beta _0+x_i\beta )\ge 0,\ (x_i,y_i)\in {\mathbb R}^{p}\times {\mathbb R},\ i=1,\ldots ,N\) is met, we cannot obtain the parameters of logistic regression via maximum likelihood. Why?
For
\(p=1\), we wish to estimate the parameters of logistic regression from
N/2 training data and to predict the responses of the
N/2 test data that are not used as the training data. Fill in the blanks and execute the program.
Hint: For prediction, see whether
\(\beta _0+x\beta _1\) is positive or negative.
×
In linear discrimination, let
\(\pi _k\) be the prior probability of
\(Y=k\) for
\(k=1,\ldots ,m\) (
\(m\ge 2\)), and let
\(f_k(x)\) be the probability density function of the
p covariates
\(x\in {\mathbb R}^p\) given response
\(Y=k\) with mean
\(\mu _k\in {\mathbb R}^p\) and covariance matrix
\(\Sigma _k\in {\mathbb R}^{p\times p}\). We consider the set
\(S_{k,l}\) of
\(x\in {\mathbb R}^p\) such that
$$\frac{\pi _k f_k(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}=\frac{\pi _l f_l(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}$$
(a)
Show that when
\(\pi _k=\pi _l\),
\(S_{k,l}\) is the set of
\(x\in {\mathbb R}^p\) on the quadratic surface
$$ (x\mu _k)^T\Sigma ^{1}_k(x\mu _k)+(x\mu _l)^T\Sigma ^{1}_l(x\mu _l)=\log \frac{\det \Sigma _k}{\det \Sigma _l}\ . $$
(b)
Show that when
\(\Sigma _k=\Sigma _l\) (
\(=\Sigma \)),
\(S_{k,l}\) is the set of
\(x\in {\mathbb R}^p\) on the surface
\(a^Tx+b=0\) with
\(a\in {\mathbb R}^p\) and
\(b\in {\mathbb R}\)) and express
a,
b using
\(\mu _k,\mu _l,\Sigma ,\pi _k,\pi _l\).
(c)
When
\(\pi _k=\pi _l\) and
\(\Sigma _k=\Sigma _l\), show that the surface of (b) is
\(x=(\mu _k+\mu _l)/2\).
$$ (x\mu _k)^T\Sigma ^{1}_k(x\mu _k)+(x\mu _l)^T\Sigma ^{1}_l(x\mu _l)=\log \frac{\det \Sigma _k}{\det \Sigma _l}\ . $$
×
Even in the case of three or more values, we can select the class that maximizes the posterior probability. From four covariates (length of sepals, width of sepals, length of petals, width of petals) of Fisher’s iris data, we wish to identify the three types of irises (Setosa, Versicolor, and Virginica) via quadratic discrimination. Specifically, we learn rules from training data and evaluate them with test data. Assuming
\(N=150\) and
\(p=4\), each of the three irises contains 50 samples, and the prior probability is expected to be equal to 1/3. If we find that the prior probabilities of Setosa, Versicolor, and Virginica irises are 0.5, 0.25, 0.25, respectively how should the program be changed to determine the maximum posterior probability?
×
In the
knearest neighbor method, we do not construct a specific rule from training data
\((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^{p}\times \)(finite set). Suppose that given a new data
\(x_*\),
\(x_i\),
\(i\in S\) are the
k training data such that the distances between
\(x_i\) and
\(x_*\) are the smallest, where
S is a subset of
\(\{1,\ldots ,n\}\) of size
k. The
knearest neighbor method predicts the response
\(y_*\) of
\(x_*\) by majority voting of
\(y_i\),
\(i \in S\). If there is more than one majority, we remove the
\(i \in S\) such that the distance between
\(x_j\) and
\(x_*\) is the largest among
\(x_j\),
\(j \in S_j\) and continue to find the majority. If
S contains exactly one element, we obtain the majority. The following process assumes that there is one test data, but the method can be extended to cases where there is more than one test data. Then, apply the method to the data in Problem 29.
×
Let
\(f_1(x)\) and
\(f_0(x)\) be the distributions for a measurement
x of people with a disease and those without the disease, respectively. For each positive
\(\theta \), the decision that the person had the symptoms was determined according to whether
$$\frac{f_1 (x)}{f_0 (x)} \ge \theta \ .$$
×
1
In this chapter, instead of
\(\beta \in {\mathbb R}^{p+1}\), we separate the slope
\(\beta \in {\mathbb R}^p\) and the intercept
\(\beta _0\in {\mathbb R}\).
 Titel
 Classification
 DOI
 https://doi.org/10.1007/9789811575686_3
 Autor:

Joe Suzuki
 Verlag
 Springer Singapore
 Sequenznummer
 3
 Kapitelnummer
 Chapter 3