nach oben

2016 | Buch

Kapitel lesen Erstes Kapitel lesen

New Theory of Discriminant Analysis After R. Fisher

Advanced Research by the Feature Selection Method for Microarray Data

verfasst von: Shuichi Shinmura

Verlag: Springer Singapore

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This is the first book to compare eight LDFs by different types of datasets, such as Fisher’s iris data, medical data with collinearities, Swiss banknote data that is a linearly separable data (LSD), student pass/fail determination using student attributes, 18 pass/fail determinations using exam scores, Japanese automobile data, and six microarray datasets (the datasets) that are LSD. We developed the 100-fold cross-validation for the small sample method (Method 1) instead of the LOO method. We proposed a simple model selection procedure to choose the best model having minimum M2 and Revised IP-OLDF based on MNM criterion was found to be better than other M2s in the above datasets.
We compared two statistical LDFs and six MP-based LDFs. Those were Fisher’s LDF, logistic regression, three SVMs, Revised IP-OLDF, and another two OLDFs. Only a hard-margin SVM (H-SVM) and Revised IP-OLDF could discriminate LSD theoretically (Problem 2). We solved the defect of the generalized inverse matrices (Problem 3).
For more than 10 years, many researchers have struggled to analyze the microarray dataset that is LSD (Problem 5). If we call the linearly separable model "Matroska," the dataset consists of numerous smaller Matroskas in it. We develop the Matroska feature selection method (Method 2). It finds the surprising structure of the dataset that is the disjoint union of several small Matroskas. Our theory and methods reveal new facts of gene analysis.

Inhaltsverzeichnis

Frontmatter

Chapter 1. New Theory of Discriminant Analysis

Abstract

A new theory of discriminant analysis “the Theory” after R. Fisher is explained. There are five serious problems with discriminant analysis. I completely solve these problems through five mathematical programming-based linear discriminant functions (MP-based LDFs). First, I develop an optimal linear discriminant function using integer programming (IP-OLDF) based on a minimum number of misclassifications (minimum NM (MNM)) criterion. We consider discriminating the data with n-cases by p-variables. The case x _i = (x _1i + … + x _pi) is p-vector (i = 1, …, n). Because I formulate IP-OLDF in the p-dimensional discriminant coefficient space b, n-linear hyperplanes (x _i × b + 1 = 0) divide the coefficient space into finite convex polyhedrons (CPs). LDFs that correspond to a CP interior point misclassify the same k-cases, and this clearly reveals the relationship between NM and the discriminant coefficient. Because there are finite CPs in the discriminant coefficient space, we should select the CP interior point with MNM. We call this CP, “optimal CP (OCP).” MNM decreases monotonously (MNM_p ≥ NMN_(p+1)). Therefore, if MNM_p = 0, all MNMs of the models, including these p-variables, are zero. If data are general positions, IP-OLDF searches for the vertex of true OCP. However, if data are not general positions, such as Student data, IP-OLDF might not search for the vertex of true OCP. Therefore, I develop Revised IP-OLDF that searches for the interior point of true OCP directly. If LDF corresponds to the CP vertex or edge, there are over p-cases on the discriminant hyperplane and LDF cannot discriminate these cases correctly (Problem 1). This fact means that NM might not be true. Only Revised IP-OLDF is free from Problem 1. When IP-OLDF discriminates Swiss banknote data that have six variables, MNM of the two-variable model (X4, X6) is zero. Therefore, 16 models, including (X4, X6), are zero, and 47 models are not linearly separable. Although a hard-margin SVM (H-SVM) indicates linearly separable data (LSD) clearly, there are few types of research on LSD discrimination. Most statisticians erroneously believe that the purpose of discrimination is to discriminate overlapping data, not LSD. All LDFs, with the exception of H-SVM and Revised IP-OLDF, might not discriminate LSD correctly (Problem 2). Moreover, such LDFs cannot determine whether the data overlap or LSD because MNM = 0 means LSD and MNM > 0 means overlap. I demonstrate that Fisher’s LDF and a quadratic discriminant function (QDF) cannot judge the pass/fail determination using examination scores and that the 18 error rates of both discriminant functions are very high. I explain the defect of the generalized inverse matrix technique and that QDF misclassifies all cases of class 1 to class 2 for a particular case (Problem 3) using Japanese-automobile data. Fisher never formulated an equation for standard errors (SEs) of the error rate and discriminant coefficient (Problem 4). The k-fold cross-validation for small sample method (Method 1) solves Problem 4. This offers the error rate means, M1 and M2, from the training and validation samples in addition to the 95 % confidence interval (CI) of the error rate and coefficient. I propose a simple and powerful model selection procedure to select the best model with minimum M2 instead of the leave-one-out (LOO) procedure. The best models of Revised IP-OLDF are better than seven other LDFs. For more than ten years, many researchers have struggled to analyze microarray dataset (the dataset) that is LSD (Problem 5). We call the linearly separable dataset as the largest Matroska. Only Revised IP-OLDF can select features naturally and find the smaller gene set or subspace (smaller Matroska) in the dataset. When we discriminate the smaller Matroska again, we can find the smaller Matroska. If we cannot find the smaller Matroska anymore, I call the last smaller Matroska as small Matroska (SM) that is linearly separable gene subspace. Because the dataset has the structure of Matroska, I develop a Matroska feature-selection method (Method 2) that finds the surprising structure of the dataset that is the disjoint union of several SMs, which are linearly separable subspaces or models. Now, we can analyze each SM very quickly because all SMs are small samples. The Theory is most suitable to analyze the datasets.

Shuichi Shinmura

Chapter 2. Iris Data and Fisher’s Assumption

Abstract

Anderson collects Iris data. That consists of three species as follows: setosa, versicolor, and virginica. Each species has four variables and 50 cases. Because Fisher evaluates Fisher’s LDF with these data, such data are very popular for the evaluation of discriminant functions. Therefore, we call these data, “Fisher’s Iris data.” Because we can easily separate setosa from virginica and vercicolor through a scatter plot, we usually discriminate two classes, such as the virginica and vercicolor. In this book, our main policy of discrimination consists of two parts: (1) Discriminate the original data by six MP-based LDFs, QDF, and RDA in addition to two statistical LDFs. LINGO solves six MP-based LDFs, such as Revised IP-OLDF, Revised LP-OLDF, Revised IPLP-OLDF, two S-SVMs, and H-SVM explained in Sect. 2.3.3. Downloading a free version of LINGO with manual from LINDO Systems Inc. allows anyone to analyze data. JMP discriminates data by QDF and RDA, in addition to two LDFs, such as Fisher’s LDF and logistic regression. We evaluate nine discriminant functions by NM, except for H-SVM. (2) Generate resampling samples from the original data and discriminate such resampling samples by the 100-fold cross-validation for small sample method (Method 1). We compare five MP-based LDFs and two statistical LDFs by the mean of error rates of the validation sample (M2), and the 95 % CI of discriminant coefficients. We explain the LINGO Program 2 of the Method 1 in Chap. 9. Because there is a small difference among seven NMs by LINGO Program 1 in Section 2.3.3, with the exception of H-SVM, and we cannot evaluate the ranking of seven LDFs clearly, we should no longer use Iris data to evaluate discriminant functions. Fisher proposed Fisher’s LDF under Fisher’s assumption. However, there are no actual test statistics to determine whether the data satisfy Fisher’s assumption. If the data satisfy Fisher’s assumption, NM of Fisher’s LDF continues to converge on MNM. Although there is no actual test for Fisher’s assumption, we can confirm it by this idea. Section 2.3.3 describes a LINGO Program 1 of six MP-based LDFs that discriminate conventional data.

Shuichi Shinmura

Chapter 3. Cephalo-Pelvic Disproportion Data with Collinearities

Abstract

I discriminate the cephalo-pelvic disproportion (CPD) data. These data have a significant relationship with the Theory (1) We evaluated a heuristic OLDF by these data. However, we could only evaluate a six-variable model because our CPU power was poor and because of the limitations of a heuristic OLDF. Therefore, we could not extend our research. (2) These data consist of 240 patients with 19 independent variables. We specified three collinearities in these data and established how to remove such collinearities. (3) We found a strange trend of NMs by QDF and found that QDF is fragile for collinearities. Moreover, NM of Fisher’s LDF did not decrease in the 19 models from the one-variable model to the 19-variable model selected by the forward and backward stepwise procedure. On the other hand, NMs of our three MP-based optimal LDFs (OLDFs) decreased. (4) In the CPD data, we determined that a four-variable model is useful by the regression model selection procedure (plug-in rule1). However, the new model selection procedure that uses Method 1 recommends a nine-variable model as the best model. We believe that many variables and/or collinearities cause this difference. Because the Iris data have four variables and can satisfy Fisher’s assumption, the model selection procedure by regression analysis and the best model select the full model of seven LDFs, which are Revised IP-OLDF, Revised LP-OLDF, Revised IPLP-OLDF, SVM4 and SVM1, Fisher’s LDF, and logistic regression. This fact is the reason we should no longer use the Iris data as the evaluation data. (5) CPD data have many OCPs. This fact implies that Revised IP-OLDF can search for the several OCPs with the same MNMs and different coefficients groups that belong to different OCPs. This result means that it is difficult for us to evaluate the 95 % CI of discriminant coefficients. In this chapter, we solve these problems by the 100-fold cross-validation for small sample method (Method 1) and the best models of seven LDFs.

Shuichi Shinmura

Chapter 4. Student Data and Problem 1

Abstract

Student data consist of 40 students with six variables, which are the study hours per day (X1), spending money per month (X2), drinking days per week (X3), gender (X4), smoking (X5), and examination scores (X6). The amount of data is not large. We published four statistical books on SAS, SPSS, Statistica, and JMP using these data because the reader could easily understand the meaning of variables and data. Although we never believed the data would be helpful for our research, we discriminated the data after we completed analysis of the Iris, CPD, and random number data in 1999. When we discriminated these data using five variables with 70 points as the passing mark, we found a defect in IP-OLDF. Because four numerical variables are integer values and two variables are the binary integers 0/1, and there are many overlapping cases, the obtained vertex of convex polyhedron (CP) consists of over (p + 1) cases and the obtained solution is not true MNM. Although we recognized Problems 1 and 4 before 1980, we did not realize that Problem 1 causes a defect in IP-OLDF. By the scatter plot of two variables, as indicated in Table 1.1, we found that the reason for the defect in IP-OLDF is the result of Problem 1. However, we could not revise this problem until 2006, when Revised IP-OLDF solved Problem 1 completely. In 2004, IP-OLDF found that Swiss banknote data are LSD, and no LDFs, with the exception of Revised IP-OLDF and H-SVM, could discriminate LSD theoretically (Problem 2). In 2005, we were able to validate the discrimination of original data (the training sample) by 20,000 resampling samples (the validation sample). After 2006, we could compare six MP-based LDFs and two statistical LDFs. After 2009, we developed the 100-fold cross-validation for small sample method (the Method 1). The Method 1 solves Problem 4, and the best model provides clear evaluation of eight LDFs. Although we could not explain the useful meaning of 95 % CI of the coefficient, we completed the basic research in 2010. After 2010, applied research started on LSD discrimination using the pass/fail determination that employs examination scores. We find Problem 3 that was solved in 2013. In 2015, the applied research was completed because we could successfully explain the useful meaning of 95 % CI of the coefficient, and the Theory Method 1 solved Problem 4 completely. In October 2015, a young researcher, Ishii, presented the challenging results of microarray datasets using principal component analysis (PCA). Because the researcher indicated six microarray datasets on HP (http://www.bioinf.ucd.ie/people/ian/), we developed the Matroska feature-selection method (Method 2) within 41 days. For more than ten years, many researchers have struggled to analyze microarray datasets because the datasets consist of few cases with huge genes (Problem 5). The Theory is most suitable for Problem 5. Recently, many researchers have expected LASSO to solve Problem 5. Because Revised IP-OLDF selects features naturally, they should compare their results to ours through the Swiss banknote data, Japanese-automobile data, Student linearly separable data, and six microarray datasets. Such comparison should be helpful for LASSO research.

Shuichi Shinmura

Chapter 5. Pass/Fail Determination Using Examination Scores

A Trivial Linear Discriminant Function

Abstract

In this chapter, we examine the k-fold cross-validation for small sample method (Method 1) by combining the resampling technique with k-fold cross-validation. By this breakthrough, we obtain the error rate means, M1 and M2, for the training and validation samples, respectively, and the 95 % CI of the discriminant coefficient and the error rate. Moreover, we propose a straightforward and powerful model selection procedure where we select the model with minimum M2 as the best model. We apply Method 1 and model selection procedure to the pass/fail determination using examination scores. By setting the intercept to one for seven LDFs, we obtain several good results, as follows: (1) M2 of Fisher’s LDF is over 4.6 % worse than Revised IP-OLDF. (2) The soft-margin SVM (S-SVM) for penalty c = 1 (SVM) is worse than other five MP-based LDFs and logistic regression. (3) We obtain the 95 % CI of the discriminant coefficients. If we select the coefficient median of seven LDFs, with the exception of Fisher’s LDF, the coefficient median is almost the same as the trivial LDF for the linearly separable model. (4) Although these datasets are LSD and show the natural feature-selection, we do not introduce this theme because there are only four independent variables.

Shuichi Shinmura

Chapter 6. Best Model for Swiss Banknote Data

Explanation 1 of Matroska Feature-Selection Method (Method 2)

Abstract

When we discriminate Swiss banknote data by IP-OLDF, we find that these data are linearly separable data (LSD). Because we examine all possible combination models, we can find that a two-variable model, such as (X4, X6), is the minimum linearly separable model. A total of 16 models, including these two variables, are linearly separable by the monotonic decrease of MNM (MNM_p ≥ MNM_(p+1)), and other 47 models are not linearly separable. Therefore, we compare eight LDFs by the best models with the minimum error rate mean in the validation sample (M2) and obtain good results. Although we could not explain the useful meaning of the 95 % CI of discriminant coefficient until now, the pass/fail determination using examination scores provide a clear understanding by normalizing the coefficient in Chap. 5. Seven LDFs become trivial LDFs. Only Fisher’s LDF is not trivial. Seven LDFs are Revised IP-OLDF based on MNM, Revised LP-OLDF, Revised IPLP-OLDF, three SVMs, and logistic regression. We successfully explain the meaning of coefficient. Therefore, we discuss the relationship between the best model and coefficient more precisely by Swiss banknote data in Chap. 6. We study LSD discrimination by Swiss banknote data, Student linearly separable data in Chap. 4, six pass/fail determinations using examination scores in Chap. 5, and Japanese-automobile data in Chap. 7, precisely. When we discriminate six microarray datasets that are LSD in Chap. 8, only Revised IP-OLDF can naturally make feature-selection and reduce the high-dimensional gene space to the small gene space drastically. In gene analysis, we call all linearly separable models, “Matroska.” The full model is the largest Matroska that includes all smaller Matroskas in it. As we already knew, the smallest Matroska (BGS) can explain the Matroska structure completely through the monotonic decrease of MNM. We propose the Matroska feature-selection method for the microarray dataset (Method 2). Because LSD discrimination is no longer popular, we explain Method 2 through detailed examples of the Swiss banknote and Japanese-automobile data. On the other hand, LASSO attempts to make feature-selection. If it cannot find the small Matroska (SM) in the dataset, it cannot explain the Matroska structure. Swiss banknote data, Japanese-automobile data, and six microarray datasets are helpful for evaluating the usefulness of other feature-selection methods, including LASSO.

Shuichi Shinmura

Chapter 7. Japanese-Automobile Data

Explanation 2 of Matroska Feature-Selection Method (Method 2)

Abstract

Japanese-automobile data consist of 29 regular and 15 small cars with six independent variables, such as the emission rate (X1), price (X2), number of seats (X3), CO₂ (X4), fuel (X4), and sales (X6). The following points are important for this book: (1) LSD discrimination: We can easily recognize that these data are LSD because X1 and X3 can separate two classes completely by two box–whisker plots. (2) Problem 3: The forward stepwise procedure selects X1, X2, X3, X4, X5, and X6 in this order. Although MNM of Revised IP-OLDF and NM of QDF are zeroes in the one-variable model (X1), QDF misclassifies all regular cars as small cars after X3 enters the model because the X3 value in small cars is four (Problem 3). These data are very suitable for explaining Problem 3 because they are easier than examination scores that use 100 items. (3) Explanation of Method 2 by these data: When we discriminate six microarray datasets by eight LDFs, only Revised IP-OLDF can naturally make the feature-selection and reduce the high-dimensionnal gene space to the small gene subspace that is a linearly separable model. We call these subspaces, “Matroska.” We establish the Matroska feature-selection method for the microarray dataset (Method 2), and the data consist of several disjoint small Matroskas with MNM = 0. Because LSD discrimination is not popular now and Method 2 has several unknown ideas, we explain these ideas by these data in addition to the Swiss banknote data from Chap. 6 and Student linearly separable data in Chap. 4. If the data are LSD, the full model is the largest Matroska that contains many smaller Matroskas in it. We already know that the smallest Matroska (the basic gene set or subspase, BGS) can describe the Matroska structure completely because MNM decreases monotonously. On the other hand, LASSO attempts to make feature-selection. If it cannot find BGS in the dataset, it cannot explain the dataset structure. Therefore, LASSO researchers have better examine their method by two common data before examining microarray datasets. If they are not successful in these ordinary data, it is not logical for them to expect a successful result for gene analysis. In particular, Japanese-automobile data are simple data for feature-selection because only two one-variable models are linearly separable and BGSs.

Shuichi Shinmura

Chapter 8. Matroska Feature-Selection Method for Microarray Dataset (Method 2)

Abstract

In this chapter, we introduce the Matroska feature-selection method (Method 2) for microarray dataset (dataset). We have already established the new theory of discriminant analysis (Theory) and developed Revised IP-OLDF. Discriminant analysis has five serious problems. We could not discriminate cases on the discriminant hyperplane (Problem 1) correctly. Only Revised IP-OLDF could solve this problem theoretically. Only H-SVM and Revised IP-OLDF could discriminate the linearly separable data (LSD) theoretically (Problem 2). Problem 3 was that the generalized inverse matrices technique and QDF misclassified all cases to another class for a particular case. We solved Problem 3. Fisher never formulated the standard-error equation for the error rate and discriminant coefficient (Problem 4). We developed the 100-fold cross-validation for a small sample method (Method 1) instead of LOO procedure. The Method 1 offers a 95 % CI for the error rate and coefficient. We obtained two means of the error rates, M1 and M2, in the training and validation samples and proposed a simple model selection procedure to choose the best model with a minimum M2. We compared two statistical LDFs and six MP-based LDFs: Fisher’s LDF, logistic regression, H-SVM, two S-SVM, Revised IP-OLDF, and another two OLDFs. The best model of Revised IP-OLDF, based on MNM criterion, was found to be better than the seven other best models (M2 s) in the six different types of data. For more than ten years, many researchers have been struggling to analyze the microarray dataset (Problem 5). Only Revised IP-OLDF can naturally select features. We developed a Matroska feature-selection method (Method 2), which finds a surprising dataset structure, which is the disjoint union of several linearly separable subspaces (small Matroskas, SMs). Now, we can analyze SM very quickly. Recently, many researchers have focused on LASSO for making feature-selections, the same as the Method 2. This chapter offers useful datasets and results for LASSO research on the following points:

Can LDF by LASSO discriminate our eight different types of datasets exactly?

Can LDF by LASSO find the Matroska structure correctly and list all of the smallest basic gene sets or subspaces (BGSs)?

Shuichi Shinmura

Chapter 9. LINGO Program 2 of Method 1

Abstract

Although Fisher established the statistical discriminant analysis based on variance–covariance matrix, he did not define the equation of SE of error rate and discriminant coefficient. Therefore, we proposed the 100-fold cross-validation for small sample method (the Method 1). The Method 1 is the combination of resampling and k-fold cross-validation. We generate large sample as validation sample by resampling and undertake 100-fold cross-validation using large sample. The Method 1 is as follows: (1) We copy 100 times the data from the original data using JMP. (2) We add a uniform random number as a new variable, sort the data in ascending order, and divide it into 100 subsets that are used as 100 training samples. (3) We evaluate eight LDFs by the Method 1 using these 100 subsets as 100 training samples and large sample as validation sample. I develop LINGO Program 2 of the Method 1 for six MP-based LDFs and JMP script for Fisher’s LDF and logistic regression. In this Chapter, we explain LINGO Program 2. Because we need more pages to explain JMP script, we omit the explanation. LINGO Program 2 supports six MP-based LDFs such as Revised IP-OLDF, Revised LP-OLDF, Revised IPLP-OLDF, H-SVM, SVM4, and SVM1. There is merit in using 100-fold cross-validation because we can easily calculate the 95 % CI of the discriminant coefficients and error rates. Moreover, two error rate means, M1 and M2, in the training and validation samples offer direct and powerful model selection procedure such as the best model. We can show the best models of Revised IP-OLDF are better than other LDFs.

Shuichi Shinmura

Backmatter

Titel: New Theory of Discriminant Analysis After R. Fisher
verfasst von: Shuichi Shinmura
Verlag: Springer Singapore
Electronic ISBN: 978-981-10-2164-0
Print ISBN: 978-981-10-2163-3
DOI: https://doi.org/10.1007/978-981-10-2164-0

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Chapter 1. New Theory of Discriminant Analysis

Chapter 2. Iris Data and Fisher’s Assumption

Chapter 3. Cephalo-Pelvic Disproportion Data with Collinearities

Chapter 4. Student Data and Problem 1

Chapter 5. Pass/Fail Determination Using Examination Scores

Chapter 6. Best Model for Swiss Banknote Data

Chapter 7. Japanese-Automobile Data

Chapter 8. Matroska Feature-Selection Method for Microarray Dataset (Method 2)

Chapter 9. LINGO Program 2 of Method 1

Backmatter

Premium Partner