Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2019

07.08.2018 | Regular Article

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

verfasst von: Zakariya Yahya Algamal, Muhammad Hisyam Lee

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Algamal ZY, Lee MH (2015a) Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst Appl 42:9326–9332CrossRef Algamal ZY, Lee MH (2015a) Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst Appl 42:9326–9332CrossRef
Zurück zum Zitat Algamal ZY, Lee MH (2015b) Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med 67:136–145CrossRef Algamal ZY, Lee MH (2015b) Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med 67:136–145CrossRef
Zurück zum Zitat Algamal ZY, Lee MH (2015c) Applying penalized binary logistic regression with correlation based elastic net for variables selection. J Mod Appl Stat Methods 14:168–179CrossRef Algamal ZY, Lee MH (2015c) Applying penalized binary logistic regression with correlation based elastic net for variables selection. J Mod Appl Stat Methods 14:168–179CrossRef
Zurück zum Zitat Algamal ZY, Lee MH (2015d) High dimensional logistic regression model using adjusted elastic net penalty. Pak J Stat Oper Res 11:667–676MathSciNetCrossRef Algamal ZY, Lee MH (2015d) High dimensional logistic regression model using adjusted elastic net penalty. Pak J Stat Oper Res 11:667–676MathSciNetCrossRef
Zurück zum Zitat Algamal ZY, Lee MH (2015e) Adjusted adaptive lasso in high-dimensional Poisson regression model. Mod Appl Sci 9:170–176CrossRef Algamal ZY, Lee MH (2015e) Adjusted adaptive lasso in high-dimensional Poisson regression model. Mod Appl Sci 9:170–176CrossRef
Zurück zum Zitat Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750CrossRef Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750CrossRef
Zurück zum Zitat Asar Y, Genç A (2015) New shrinkage parameters for the Liu-type logistic estimators. Commun Stat Simul Comput 45:1094–1103MathSciNetCrossRefMATH Asar Y, Genç A (2015) New shrinkage parameters for the Liu-type logistic estimators. Commun Stat Simul Comput 45:1094–1103MathSciNetCrossRefMATH
Zurück zum Zitat Ben Brahim A, Limam M (2016) A hybrid feature selection method based on instance learning and cooperative subset search. Pattern Recogn Lett 69:28–34CrossRef Ben Brahim A, Limam M (2016) A hybrid feature selection method based on instance learning and cooperative subset search. Pattern Recogn Lett 69:28–34CrossRef
Zurück zum Zitat Bielza C, Robles V, Larrañaga P (2011) Regularized logistic regression without a penalty term: an application to cancer classification with microarray data. Expert Syst Appl 38:5110–5118CrossRef Bielza C, Robles V, Larrañaga P (2011) Regularized logistic regression without a penalty term: an application to cancer classification with microarray data. Expert Syst Appl 38:5110–5118CrossRef
Zurück zum Zitat Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45:531–539CrossRef Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45:531–539CrossRef
Zurück zum Zitat Bootkrajang J, Kabán A (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29:870–877CrossRef Bootkrajang J, Kabán A (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29:870–877CrossRef
Zurück zum Zitat Cawley GC, Talbot NLC (2006) Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 22:2348–2355CrossRef Cawley GC, Talbot NLC (2006) Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 22:2348–2355CrossRef
Zurück zum Zitat Chen Y, Wang L, Li L, Zhang H, Yuan Z (2016) Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinform 17:44–57CrossRef Chen Y, Wang L, Li L, Zhang H, Yuan Z (2016) Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinform 17:44–57CrossRef
Zurück zum Zitat Cui Y, Zheng CH, Yang J, Sha W (2013) Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data. Comput Biol Med 43:933–941CrossRef Cui Y, Zheng CH, Yang J, Sha W (2013) Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data. Comput Biol Med 43:933–941CrossRef
Zurück zum Zitat Drotar P, Gazda J, Smekal Z (2015) An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 66:1–10CrossRef Drotar P, Gazda J, Smekal Z (2015) An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 66:1–10CrossRef
Zurück zum Zitat Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetCrossRefMATH Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetCrossRefMATH
Zurück zum Zitat Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70:849–911MathSciNetCrossRefMATH Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70:849–911MathSciNetCrossRefMATH
Zurück zum Zitat Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604MathSciNetCrossRefMATH Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604MathSciNetCrossRefMATH
Zurück zum Zitat Ferreira AJ, Figueiredo MAT (2012) Efficient feature selection filters for high-dimensional data. Pattern Recogn Lett 33:1794–1804CrossRef Ferreira AJ, Figueiredo MAT (2012) Efficient feature selection filters for high-dimensional data. Pattern Recogn Lett 33:1794–1804CrossRef
Zurück zum Zitat Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22CrossRef Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22CrossRef
Zurück zum Zitat Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRef Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRef
Zurück zum Zitat Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967 Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967
Zurück zum Zitat Guo S, Guo D, Chen L, Jiang Q (2016) A centroid-based gene selection method for microarray data classification. J Theor Biol 400:32–41MathSciNetCrossRefMATH Guo S, Guo D, Chen L, Jiang Q (2016) A centroid-based gene selection method for microarray data classification. J Theor Biol 400:32–41MathSciNetCrossRefMATH
Zurück zum Zitat Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
Zurück zum Zitat Han B, Li L, Chen Y, Zhu L, Dai Q (2011) A two step method to identify clinical outcome relevant genes with microarray data. J Biomed Inf 44:229–238CrossRef Han B, Li L, Chen Y, Zhu L, Dai Q (2011) A two step method to identify clinical outcome relevant genes with microarray data. J Biomed Inf 44:229–238CrossRef
Zurück zum Zitat Huang HH, Liu XY, Liang Y (2016) Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE 11:1–15 Huang HH, Liu XY, Liang Y (2016) Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE 11:1–15
Zurück zum Zitat Kalina J (2014) Classification methods for high-dimensional genetic data. Biocybern Biomed Eng 34:10–18CrossRef Kalina J (2014) Classification methods for high-dimensional genetic data. Biocybern Biomed Eng 34:10–18CrossRef
Zurück zum Zitat Kalousis A, Prados J, Hilario M (2006) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12:95–116CrossRef Kalousis A, Prados J, Hilario M (2006) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12:95–116CrossRef
Zurück zum Zitat Korkmaz S, Zararsiz G, Goksuluk D (2014) Drug/nondrug classification using support vector machines with various feature selection strategies. Comput Methods Programs Biomed 117:51–60CrossRef Korkmaz S, Zararsiz G, Goksuluk D (2014) Drug/nondrug classification using support vector machines with various feature selection strategies. Comput Methods Programs Biomed 117:51–60CrossRef
Zurück zum Zitat Li S, Tan EC (2005) Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans Comput Biol Bioinform 2:166–175CrossRef Li S, Tan EC (2005) Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans Comput Biol Bioinform 2:166–175CrossRef
Zurück zum Zitat Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12:1039–1048CrossRef Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12:1039–1048CrossRef
Zurück zum Zitat Li J, Jia Y, Zhao Z (2012) Partly adaptive elastic net and its application to microarray classification. Neural Comput Appl 22:1193–1200CrossRef Li J, Jia Y, Zhao Z (2012) Partly adaptive elastic net and its application to microarray classification. Neural Comput Appl 22:1193–1200CrossRef
Zurück zum Zitat Liang Y, Liu C, Luan X-Z, Leung K-S, Chan T-M, Xu Z-B, Zhang H (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinform 14:198–211CrossRef Liang Y, Liu C, Luan X-Z, Leung K-S, Chan T-M, Xu Z-B, Zhang H (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinform 14:198–211CrossRef
Zurück zum Zitat Liao JG, Chin K-V (2007) Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23:1945–1951CrossRef Liao JG, Chin K-V (2007) Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23:1945–1951CrossRef
Zurück zum Zitat Ma S, Huang J (2008) Penalized feature selection and classification in bioinformatics. Brief Bioinform 9:392–403CrossRef Ma S, Huang J (2008) Penalized feature selection and classification in bioinformatics. Brief Bioinform 9:392–403CrossRef
Zurück zum Zitat Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234MathSciNetCrossRefMATH Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234MathSciNetCrossRefMATH
Zurück zum Zitat Mao Z, Cai W, Shao X (2013) Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inf 46:594–601CrossRef Mao Z, Cai W, Shao X (2013) Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inf 46:594–601CrossRef
Zurück zum Zitat Pappua V, Panagopoulosb OP, Xanthopoulosb P, Pardalosa PM (2015) Sparse proximal support vector machines for feature selection in high dimensional datasets. Expert Syst Appl 42:9183–9191CrossRef Pappua V, Panagopoulosb OP, Xanthopoulosb P, Pardalosa PM (2015) Sparse proximal support vector machines for feature selection in high dimensional datasets. Expert Syst Appl 42:9183–9191CrossRef
Zurück zum Zitat Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics 9:30–50CrossRefMATH Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics 9:30–50CrossRefMATH
Zurück zum Zitat Shevade SK, Keerthi SS (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19:2246–2253CrossRef Shevade SK, Keerthi SS (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19:2246–2253CrossRef
Zurück zum Zitat Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209CrossRef Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209CrossRef
Zurück zum Zitat Sun H, Wang S (2012) Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 28:1368–1375CrossRef Sun H, Wang S (2012) Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 28:1368–1375CrossRef
Zurück zum Zitat Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58:267–288MathSciNetMATH Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58:267–288MathSciNetMATH
Zurück zum Zitat Wang SL, Li X, Zhang S, Gui J, Huang DS (2010) Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med 40:179–189CrossRef Wang SL, Li X, Zhang S, Gui J, Huang DS (2010) Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med 40:179–189CrossRef
Zurück zum Zitat Yang L, Qian Y (2016) A sparse logistic regression framework by difference of convex functions programming. Appl Intell 45:241–254CrossRef Yang L, Qian Y (2016) A sparse logistic regression framework by difference of convex functions programming. Appl Intell 45:241–254CrossRef
Zurück zum Zitat Yap Y, Zhang X, Ling MT, Wang X, Wong YC, Danchin A (2004) Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer 4:72CrossRef Yap Y, Zhang X, Ling MT, Wang X, Wong YC, Danchin A (2004) Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer 4:72CrossRef
Zurück zum Zitat Zhang L, Qian L, Ding C, Zhou W, Li F (2015) Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data. Comput Biol Med 64:236–245CrossRef Zhang L, Qian L, Ding C, Zhou W, Li F (2015) Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data. Comput Biol Med 64:236–245CrossRef
Zurück zum Zitat Zheng S, Liu W (2011) An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification. Comput Biol Med 41:1033–1040CrossRef Zheng S, Liu W (2011) An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification. Comput Biol Med 41:1033–1040CrossRef
Zurück zum Zitat Zhenqiu L, Feng J, Guoliang T, Suna W, Fumiaki S, Ming T (2007) Sparse logistic regression with Lp penalty for biomarker identification. Stat Appl Genet Mol Biol 6:1–22MathSciNetMATH Zhenqiu L, Feng J, Guoliang T, Suna W, Fumiaki S, Ming T (2007) Sparse logistic regression with Lp penalty for biomarker identification. Stat Appl Genet Mol Biol 6:1–22MathSciNetMATH
Zurück zum Zitat Zhu J, Hastie T (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5:427–443CrossRefMATH Zhu J, Hastie T (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5:427–443CrossRefMATH
Zurück zum Zitat Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67:301–320MathSciNetCrossRefMATH Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67:301–320MathSciNetCrossRefMATH
Metadaten
Titel
A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
verfasst von
Zakariya Yahya Algamal
Muhammad Hisyam Lee
Publikationsdatum
07.08.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2019
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-018-0334-1

Weitere Artikel der Ausgabe 3/2019

Advances in Data Analysis and Classification 3/2019 Zur Ausgabe

Premium Partner