Top

2019 | Book

Read chapter Read first chapter

High-dimensional Microarray Data Analysis

Cancer Gene Diagnosis and Malignancy Indexes by Microarray

Author: Prof. Shuichi Shinmura

Publisher: Springer Singapore

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book shows how to decompose high-dimensional microarrays into small subspaces (Small Matryoshkas, SMs), statistically analyze them, and perform cancer gene diagnosis. The information is useful for genetic experts, anyone who analyzes genetic data, and students to use as practical textbooks.

Discriminant analysis is the best approach for microarray consisting of normal and cancer classes. Microarrays are linearly separable data (LSD, Fact 3). However, because most linear discriminant function (LDF) cannot discriminate LSD theoretically and error rates are high, no one had discovered Fact 3 until now. Hard-margin SVM (H-SVM) and Revised IP-OLDF (RIP) can find Fact3 easily. LSD has the Matryoshka structure and is easily decomposed into many SMs (Fact 4). Because all SMs are small samples and LSD, statistical methods analyze SMs easily. However, useful results cannot be obtained. On the other hand, H-SVM and RIP can discriminate two classes in SM entirely. RatioSV is the ratio of SV distance and discriminant range. The maximum RatioSVs of six microarrays is over 11.67%. This fact shows that SV separates two classes by window width (11.67%). Such easy discrimination has been unresolved since 1970. The reason is revealed by facts presented here, so this book can be read and enjoyed like a mystery novel.

Many studies point out that it is difficult to separate signal and noise in a high-dimensional gene space. However, the definition of the signal is not clear. Convincing evidence is presented that LSD is a signal. Statistical analysis of the genes contained in the SM cannot provide useful information, but it shows that the discriminant score (DS) discriminated by RIP or H-SVM is easily LSD. For example, the Alon microarray has 2,000 genes which can be divided into 66 SMs. If 66 DSs are used as variables, the result is a 66-dimensional data. These signal data can be analyzed to find malignancy indicators by principal component analysis and cluster analysis.

Frontmatter

Chapter 1. New Theory of Discriminant Analysis and Cancer Gene Analysis

Abstract

This chapter explains the “New Theory of Discriminant Analysis after R. Fisher (Theory)” and the first success of cancer gene analysis as its application (Problem 5). The theory consists of four Optimal Linear Discriminant Functions (Optimal LDFs, OLDFs), two facts of discriminant analysis, two methods, and two statistics such as MNM and RatioSV. Section 1.1 summarises the theory and explains new results. Section 1.2 explains two facts as follows: (1) the relation of NM and LDF coefficient that solves Problem 1 (the defect of NM). (2) MNM monotonic decrease that is important for Problem5. Furthermore, we explain the reason why statisticians and machine learning researchers could not solve the cancer gene analysis since 1970. Only RIP and Revised LP-OLDF can decompose microarrays into many SMs. This fact is vital for cancer gene diagnosis. Section 1.3 introduces five severe problems of discriminant analysis. Section 1.4 introduces four OLDFs and three SVMs in addition to statistical discriminant functions. Section 1.5 explains the Matryoshka feature selection method (Method2) that solves Problem5 completely. Section 1.6 describes how to validate Method2 by two common data such as Swiss banknote data and Japanese car data those are LSD. Thus, this section indicates Method2 is useful for LSD including the common data and microarrays. Section 1.7 is the conclusion. We can explain the reason why only RIP and Revised LP-OLDF can decompose the microarray into many SMs. This reason is the answer why statisticians and machine learning researchers could not solve the cancer gene analysis since 1970.

Shuichi Shinmura

Chapter 2. Overview of Cancer Gene Diagnosis

Abstract

This chapter explains the cancer gene diagnosis using all Small Matryoshkas (SMs) of six microarrays found in 2016. Section 2.2 explains the different role of cancer gene analysis and cancer gene diagnosis because these technical terms are our original ones. Section 2.3 shows the analysis of 64 SMs obtained by RIP using Alon’s microarray. Section 2.4 shows the usefulness of 64RIP discriminant scores (RipDSs) and new data made by 64 RipDSs instead of 2,000 genes. Thus, we consider RipDSs new data is signal instead of 64 SM. Section 2.5 shows the same analysis of 130 BGSs of Alon’s microarray found by LINGO Program4 in 2016. BGS is as same as the Yamanaka’s four genes in iPS research. Section 2.6 shows the cancer gene diagnosis of other five microarrays those are analyzed in the same way as Alon. Section 2.7 is the conclusion. Alon and Singh’s microarrays consist of cancer and normal classes. Other four microarrays consist of two different types of cancer classes. It is vital for us that six results are almost the same. Thus, we expect another microarray’s result is as same as our results if medical researchers control two classes strictly.

Shuichi Shinmura

Chapter 3. Cancer Gene Diagnosis of Alon’s microarray by RIP and Revised LP-OLDF

Abstract

This chapter discusses the following three points. (1) We have introduced only SMs obtained with the RIP in Chap. 2. RIP analyzed SMs by Program3’ arbitrary iteration number. In 2017, we increase the number of iterations successively from 1 and select the iteration number that the number of SM obtained is constant. Moreover, we compare two types of SMs obtained by the RIP and Revised LP-OLDF and evaluate the eight LDFs and QDF by RatioSV and the number of misclassifications (NMs). (2) The microarrays are linearly separable data (LSD). However, because the statistical discriminant functions cannot discriminate LSD theoretically, many researchers could not solve the cancer gene analysis completely from 1970 (Problem5). Moreover, the Matryoshka feature selection method (Method2) and LINGO Program3 can decompose the microarray into many SMs those are LSD. Although all SMs are small samples, many statistical methods cannot find the linear separable facts. However, RIP, Revised LP-OLDF, and H-SVM can discriminate all SMs correctly. We realized the three data made by three LDFs are signal data and reduce the high-dimensional microarray to low-dimensional signal data. (3) We propose the standard procedure for how to analyze all SMs. Specialists of gene analysis can solve the cancer gene analysis and approach the cancer gene diagnosis from the new aspect. On the other hand, statisticians recognize the difficulties of cancer gene analysis and understand the easiness of the cancer gene diagnosis by statistical methods. Statistical users can analyze many SMs those are a gift from high-dimensional data and skill-up their statistical ability to solve practical applications.

Shuichi Shinmura

Chapter 4. Further Examinations of SMs—Defect of Revised LP-OLDF and Correlations of Genes

Abstract

In this chapter, we analyze Alon’s microarray in 2018 and obtain two SMs from the RIP and Revised LP-OLDF. In Sect. 4.2, RIP separates the microarray into a union of 62 SMs (1,968 genes). Six MP-based LDFs find this subspace is LSD and a noise subspace (32 genes) is not LSD. In Sect. 4.3, Revised LP-OLDF separates the microarray into a union of 32 SMs (1,005 genes) and a noise subspace (995 genes). Six MP-based LDFs find both subspaces are LSD. This fact suggests us that a noise subspace includes other SMs in it. We find Revised LP-OLDF cannot find all SMs from the microarray correctly. We guess Problem1 causes the defect of Revised LP-OLDF. Namely, Revised LP-OLDF cannot find other SMs from noise subspace. Section 4.4 analyzes 62 SMs found by the RIP and evaluates 62 SMs by RatioSV and NM. Moreover, the 1,891 correlations of 62 RIP discriminant scores (RipDSs) are computed. At first, we consider each gene set included in SM is cancer genes and a signal subspace. However, standard statistical methods cannot show the linear separable facts. Thus, we conclude that the gene sets included in all SMs are not signals. We recognize the data made by RipDSs is signal data. Two signal data of SM13 with maximum RatioSV and SM62 with minimum RatioSV are validated. Section 4.5 analyzes two signal data made by RipDSs and HsvmDSs obtained by 62 SMs of the RIP. The results are almost the same in Chaps. 2 and 3. However, these findings can open a new field of cancer gene diagnosis only after verification of the subjects used in the study of Alon et al. (Proc Natl Acad Sci USA, 96(1.1): 6745–6750 1999). Section 4.6 explains the reason why standard statistical methods could not find the linear separable facts. Section 4.9 is the conclusion.

Shuichi Shinmura

Chapter 5. Cancer Gene Diagnosis of Golub et al. Microarray

Abstract

Golub microarray consists of 72 patients and 7,129 genes. They analyzed the microarray by various statistical methods. For example, they analyzed “marker” genes having the highest correlation with the target class-by-class separation statistics (signal-to-noise ratio), weighted votes, and SOM. Mainly, discriminant analysis is the most proper method to identify oncogenes. However, because the statistical discriminant analysis was useless at all, medical researchers had developed many methods. Our theory shows that six microarrays are LSD (MNM = 0). Method2 can decompose the microarray into many Small Matryoshka (SM) those are LSD. Then, by analyzing SM, we achieved cancer gene diagnosis by malignancy indexes. If Golub et al. validate our results, cancer gene diagnosis will be more improved. Method2 already obtained the different sets of SM in Chap. 2. In 2018, we change the number of iterations of RIP and Revised LP-OLDF in Method2 and decided the proper number of iterations as same as Alon's microarray in Chap. 4. We obtained SM by those iteration numbers. We examined the signal data made by RIP discriminant scores (RipDSs). We confirm the Revised LP-OLDF cannot find all SMs as same as Alon's microarray. Thus, we analyze only 179 SMs obtained by the RIP and examine the correlation coefficient of 179 RipDSs. We compare RatioSV of six MP-based LDFs and NM of statistical discriminant function. Then, the cluster analysis and PCA analyze signal data made by RIP and H-SVM. We propose the possibility of cancer gene diagnosis such as malignancy indexes. We propose how to find new subclasses of cancer pointed out by Golub et al. (Science 286(5439): 531–537, 1999).

Shuichi Shinmura

Chapter 6. Cancer Gene Diagnosis of Shipp et al. Microarray

Abstract

Shipp microarray consists of 77 patients and 7,129 genes. They analyzed the microarray by various statistical methods and uploaded a supplemental document with 67 pages. They used almost the same methods as Golub et al. except for SVM and nearest neighbor cluster. Mainly, discriminant analysis is the most appropriate method to identify oncogenes from the microarray. However, because the statistical discriminant analysis was useless at all, medical researchers had developed many methods for cancer gene analysis. Our theory shows that six microarrays are LSD (Fact3). Method2 decomposes the microarrays into many SMs (Fact4). Then, by analyzing SM, we propose cancer gene diagnosis and malignancy indexes. If Shipp et al. validate our research results, we will improve cancer gene diagnosis. Method2 already obtained SM twice in Chap. 2. In this research, we change the number of iterations of RIP and Revised LP-OLDF in Method2 and decided the proper number of iterations. We obtain SMs by those iteration numbers in 2018. We examined the signal subspace made by all SMs and the noise space. However, Revised LP-OLDF cannot correctly find all SMs from Shipp microarray as same as Chap. 4. Thus, we analyze only 237 SMs obtained by the RIP and examine the correlation coefficient of RipDSs. RatioSVs evaluate RIP, Revised LP-OLDF, and H-SVM. Then, we analyze two signal data and transposed data made by RIP and H-SVM. By the hierarchical cluster analysis and PCA, we can propose the possibility of cancer gene diagnoses such as malignancy indexes.

Shuichi Shinmura

Chapter 7. Cancer Gene Diagnosis of Singh et al. Microarray

Abstract

Chapter 1 explained the new theory of discriminant analysis after R. A. Fisher (Theory). The theory solved five problems completely. Especially, Revised IP-OLDF (RIP) and Method2 firstly succeeded in the cancer gene analysis. RIP could find six microarrays were LSD (Fact3). LINGO Program3 of Method2 could decompose the microarray into many SMs and another noise subspace (Fact4). In Chap. 2, we make signal data made by RIP discriminant scores (RipDSs). Our breakthrough opens the new frontier of cancer gene diagnosis and malignancy indexes. We find the new problem (Problem6): “Why could no researchers find the linear separable facts in microarrays and SM from 1970?” In this book, we explain the several answers of Problem6. In this chapter, we survey how to make different RipDSs from many SMs. It explains why microarray consists of many SMs and the different RipDSs. By these results, we wish to classify SMs into several categories of malignancy indexes in the future.

Shuichi Shinmura

Chapter 8. Cancer Gene Diagnosis of Tian et al. Microarray

Abstract

We developed the New Theory of Discriminant Analysis after R. A. Fisher (theory). Although there are five severe problems of discriminant analysis, theory solves five problems completely. Especially, Revised IP-OLDF (RIP) based on MNM and Method2 firstly succeed in the cancer gene analysis (Problem5) from 1970. RIP decomposes six microarrays into the many SMs those are signals (MNM = 0) explained in Chap. 1. Although Revised LP-OLDF decomposes the microarray into many SMs as same as RIP, we find the defect of Revised LP-OLDF that cannot find all SMs from the microarray in Chap. 4. However, Revised LP-OLDF can find many SMs faster than RIP. It may be convenient for many researchers to analyze SMs found by Revised LP-OLDF. Tian’s microarray consists of 173 subjects (36 False subjects and 137 True patients) and 12,625 genes. In this chapter, Revised LP-OLDF decomposes Tian’s microarray into the 104 SMs. We analyze 104 SMs by the standard statistical method such as one-way ANOVA, t-test, Ward cluster analysis, PCA, logistic regression, and Fisher’s LDF. Although we expected standard statistical methods were useful for cancer gene diagnosis, only logistic regression could discriminate 104 SMs correctly, and other methods did not show the linear separable facts. Because Revised LP-OLDF discriminates 104 SMs, and the range of 104 RatioSVs is [8.34%, 22.79%], we make signal data by 104 Revised LP-OLDF discriminant scores (LpDSs) instead of 12,625 genes. By this breakthrough, hierarchical cluster methods can separate two classes as two clusters entirely. In addition to these results, the Prin1 axis of PCA indicates proper malignancy indexes as same as 104 malignancy indexes. Thus, we reconsider the signal data is the signal. Moreover, we examine the characteristic of 104 LpDSs precisely as same as Chap. 7 using the correlation analysis.

Shuichi Shinmura

Chapter 9. Cancer Gene Diagnosis of Chiaretti et al. Microarray

Abstract

This chapter introduces the cancer gene diagnosis of Chiaretti microarray that consists of 128 patients and 12,625 genes. RIP finds 128 SMs, and Revised LP-OLDF finds 124 SMs. We confirm the defect of Revised LP-OLDF, also. Because both SMs are almost the same results, we introduce only the results of 124 SMs. In Sect. 9.2, we confirm the 7,626 correlations of 124 LpDSs are greater than 0.359 and standard statistical methods cannot find the linear separable facts of SMs. Thus, we conclude three signal data made by RIP, Revised LP-OLDF, and H-SVM are the better definition of the signal instead of SMs. Also, we explain how to build 124 LpDSs. In Sect. 9.3, the 124 SMs are evaluated by RatioSVs of six MP-based LDFs and NMs of statistical discriminant functions. In Sect. 9.4, five hierarchical cluster methods analyze three signal data of 124 RipDSs, LpDSs, and HsvmDSs. In Sects. 9.5 and 9.6, PCA analyzes signal data and transposed signal data. Section 9.7 concludes six microarrays have almost the same results. We believe that the consistency of these results confirms the reliability of cancer gene diagnosis.

Shuichi Shinmura

Chapter 10. LINGO Programs of Cancer Gene Analysis

Abstract

In “New Theory of Discriminant Analysis after R. Fisher” (2016), Shinmura had already explained LINGO Program1 in Chap. 2. LINGO Program1 defines six MP-based LDFs such as Revised IP-OLDF (RIP), Revised LP-OLDF, Revised IPLP-OLDF, H-SVM, two soft-margin SVMs such as SVM4 (penalty c = 10000) and SVM1 (penalty c = 1). Everyone can evaluate six MP-based LDFs in the training samples at once. If you can understand these models, you can develop your bespoken models by yourself. LINGO Program2 can discriminate a small training sample by the Method1 instead of LOO method. If you can understand LINGO Program2, you can build the complex MP models to control several optimizable models with many datasets as arrays. LINGO can control the complex optimization models. In this chapter, we explained the Matryoshka feature selection method (Method2). This chapter explains LINGO Program3 in addition to Linus’s Linear Discriminant Function. Section 10.1 introduces the role of three LINGO programs. Section 10.2 introduces LINGO sample model (DiscrmSwiss.lng) that is a sample model downloaded from LINDO Systems Inc. HP (https://www.lindo.com/). Everybody can download many fine models, manuals, textbooks, and evaluation solvers such as LINGO, What’s Best! (Excel add-in), and LINGO/API (c libraries to develop bespoken models and systems) in free. In order to simplify the program, we assume that class1 has a discriminant score (DS) of 1 or more and class2 becomes −1 or less as explained in Chap. 1. Then, it converts original data of class2 by multiplying by −1 and thinks that the extended DSs are judged correctly more than 1. The LDF introduced in this section can be used without converting a sign of class2 data. Section 10.3 introduces six MP-based LDFs. Section 10.4 introduces LINGO Program 3 of Method2. Section 10.5 introduces the validation of Method2 by LINGO Program1 using common data. Section 10.6 is conclusion.

Shuichi Shinmura

Backmatter

Title: High-dimensional Microarray Data Analysis
Author: Prof. Shuichi Shinmura
Publisher: Springer Singapore
Electronic ISBN: 978-981-13-5998-9
Print ISBN: 978-981-13-5997-2
DOI: https://doi.org/10.1007/978-981-13-5998-9

Springer Professional

High-dimensional Microarray Data Analysis

Cancer Gene Diagnosis and Malignancy Indexes by Microarray

About this book

Table of Contents

Frontmatter

Chapter 1. New Theory of Discriminant Analysis and Cancer Gene Analysis

Chapter 2. Overview of Cancer Gene Diagnosis

Chapter 3. Cancer Gene Diagnosis of Alon’s microarray by RIP and Revised LP-OLDF

Chapter 4. Further Examinations of SMs—Defect of Revised LP-OLDF and Correlations of Genes

Chapter 5. Cancer Gene Diagnosis of Golub et al. Microarray

Chapter 6. Cancer Gene Diagnosis of Shipp et al. Microarray

Chapter 7. Cancer Gene Diagnosis of Singh et al. Microarray

Chapter 8. Cancer Gene Diagnosis of Tian et al. Microarray

Chapter 9. Cancer Gene Diagnosis of Chiaretti et al. Microarray

Chapter 10. LINGO Programs of Cancer Gene Analysis

Backmatter

Premium Partner