Skip to main content
Top
Published in: The Journal of Supercomputing 6/2021

06-11-2020

Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm

Author: Hung-Yi Lin

Published in: The Journal of Supercomputing | Issue 6/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, a novel gene selection benefiting from feature clustering and feature discretization is developed. In large numbers of genes, unsupervised fuzzy clustering algorithm facilitates the analysis of both similarities and dissimilarities. The supervised process, adopting information gain and statistical Chi-square test, is applied to approve the relevant gene clusters. Then, expectation–maximization algorithm discretizes the candidate genes and helps to recognize distinguishability. In our previously proposed selection criterion, we finalized gene selection and generated the gene subsets for molecular classification. For high-dimensional datasets congested with erroneous or ambiguous information, the current scheme is particularly suitable in its own right. The efficiency and effectiveness are verified by our experimental results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alonso-Betanzos A, Bolón-Canedo V, Morán-Fernández L, Sánchez-Maroño N (2019) A review of microarray datasets: where to find them and specific characteristics. In: Microarray bioinformatics, Humana, New York, NY, pp 65–85 Alonso-Betanzos A, Bolón-Canedo V, Morán-Fernández L, Sánchez-Maroño N (2019) A review of microarray datasets: where to find them and specific characteristics. In: Microarray bioinformatics, Humana, New York, NY, pp 65–85
2.
go back to reference Dessì N, Pes B (2015) Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst Appl 42(10):4632–4642CrossRef Dessì N, Pes B (2015) Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst Appl 42(10):4632–4642CrossRef
3.
go back to reference Mohapatra P, Chakravarty S, Dash PK (2016) Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol Comput 28:144–160CrossRef Mohapatra P, Chakravarty S, Dash PK (2016) Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol Comput 28:144–160CrossRef
4.
go back to reference Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Can classification performance be predicted by complexity measures? A study using microarray data. Knowl Inf Syst 51(3):1067–1090CrossRef Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Can classification performance be predicted by complexity measures? A study using microarray data. Knowl Inf Syst 51(3):1067–1090CrossRef
5.
go back to reference Chen J, Stern M, Wainwright MJ, Jordan MI (2017) Kernel feature selection via conditional covariance minimization. Preprint arXiv:1707.01164 Chen J, Stern M, Wainwright MJ, Jordan MI (2017) Kernel feature selection via conditional covariance minimization. Preprint arXiv:​1707.​01164
6.
go back to reference Liu T, Tao D (2016) Classification with noisy labels by importance reweighting. IEEE Trans Pattern Anal Mach Intell 38(3):447–461MathSciNetCrossRef Liu T, Tao D (2016) Classification with noisy labels by importance reweighting. IEEE Trans Pattern Anal Mach Intell 38(3):447–461MathSciNetCrossRef
7.
go back to reference Novaković J (2016) Toward optimal feature selection using ranking methods and classification algorithms. Yugoslav J Oper Res 21(1) Novaković J (2016) Toward optimal feature selection using ranking methods and classification algorithms. Yugoslav J Oper Res 21(1)
8.
go back to reference Song X, Zhang J, Han Y, Jiang J (2016) Semi-supervised feature selection via hierarchical regression for web image classification. Multimedia Syst 22(1):41–49CrossRef Song X, Zhang J, Han Y, Jiang J (2016) Semi-supervised feature selection via hierarchical regression for web image classification. Multimedia Syst 22(1):41–49CrossRef
9.
go back to reference Golay J, Kanevski M (2017) Unsupervised feature selection based on the morisita index for hyperspectral images. In: EGU General Assembly Conference Abstracts, vol 19, p 14396 Golay J, Kanevski M (2017) Unsupervised feature selection based on the morisita index for hyperspectral images. In: EGU General Assembly Conference Abstracts, vol 19, p 14396
10.
go back to reference Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32:112–123CrossRef Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32:112–123CrossRef
11.
go back to reference Wang S, Pedrycz W, Zhu Q, Zhu W (2015) Unsupervised feature selection via maximum projection and minimum redundancy. Knowl-Based Syst 75:19–29CrossRef Wang S, Pedrycz W, Zhu Q, Zhu W (2015) Unsupervised feature selection via maximum projection and minimum redundancy. Knowl-Based Syst 75:19–29CrossRef
12.
go back to reference Wang S, Wang H (2017) Unsupervised feature selection via low-rank approximation and structure learning. Knowl-Based Syst 124:70–79CrossRef Wang S, Wang H (2017) Unsupervised feature selection via low-rank approximation and structure learning. Knowl-Based Syst 124:70–79CrossRef
13.
go back to reference Zhou W, Wu C, Yi Y, Luo G (2017) Structure preserving nonnegative feature self-representation for unsupervised feature selection. IEEE Access Zhou W, Wu C, Yi Y, Luo G (2017) Structure preserving nonnegative feature self-representation for unsupervised feature selection. IEEE Access
14.
go back to reference Naghieh E, Peng Y (2009) Microarray gene expression data mining: clustering analysis review Naghieh E, Peng Y (2009) Microarray gene expression data mining: clustering analysis review
15.
go back to reference Au WH, Chan KC, Wong AK, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 2(2):83–101CrossRef Au WH, Chan KC, Wong AK, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 2(2):83–101CrossRef
16.
go back to reference Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386CrossRef Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386CrossRef
17.
go back to reference Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp 94–105 Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp 94–105
18.
go back to reference Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514CrossRef Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514CrossRef
19.
go back to reference Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902CrossRef Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902CrossRef
20.
go back to reference Mittal M, Goyal LM, Hemanth DJ, Sethi JK (2019) Clustering approaches for high-dimensional databases: a review. Wiley Interdiscip Rev Data Min Knowl Discov 9(3):e1300CrossRef Mittal M, Goyal LM, Hemanth DJ, Sethi JK (2019) Clustering approaches for high-dimensional databases: a review. Wiley Interdiscip Rev Data Min Knowl Discov 9(3):e1300CrossRef
21.
go back to reference Feng J, Jiao L, Liu F, Sun T, Zhang X (2016) Unsupervised feature selection based on maximum information and minimum redundancy for hyperspectral images. Pattern Recogn 51:295–309CrossRef Feng J, Jiao L, Liu F, Sun T, Zhang X (2016) Unsupervised feature selection based on maximum information and minimum redundancy for hyperspectral images. Pattern Recogn 51:295–309CrossRef
22.
go back to reference Lin HY (2013) Feature selection based on cluster and variability analyses for ordinal multi-class classification problems. Knowl-Based Syst 37:94–104CrossRef Lin HY (2013) Feature selection based on cluster and variability analyses for ordinal multi-class classification problems. Knowl-Based Syst 37:94–104CrossRef
23.
go back to reference Stańczyk U, Jain LC (eds) (2015) Feature selection for data and pattern recognition. Springer, BerlinMATH Stańczyk U, Jain LC (eds) (2015) Feature selection for data and pattern recognition. Springer, BerlinMATH
24.
go back to reference Battiti R (1994) Using mutual information for selecting features in supervised neural net learning”. IEEE Trans Neural Netw 5(4):537–550CrossRef Battiti R (1994) Using mutual information for selecting features in supervised neural net learning”. IEEE Trans Neural Netw 5(4):537–550CrossRef
25.
go back to reference Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
26.
go back to reference Kwak N, Choi C-H (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 3(1):143–159CrossRef Kwak N, Choi C-H (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 3(1):143–159CrossRef
27.
go back to reference Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef
28.
go back to reference Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135CrossRef Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135CrossRef
29.
go back to reference Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Feature selection in DNA microarray classification. In: Feature selection for high-dimensional data, Springer International Publishing, Springer, Cham, pp 61–94 Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Feature selection in DNA microarray classification. In: Feature selection for high-dimensional data, Springer International Publishing, Springer, Cham, pp 61–94
30.
go back to reference Canul-Reich J, Hall LO, Goldgof DB, Korecki JN, Eschrich S (2012) Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recognit Artif Intell 26(05):1260003MathSciNetCrossRef Canul-Reich J, Hall LO, Goldgof DB, Korecki JN, Eschrich S (2012) Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recognit Artif Intell 26(05):1260003MathSciNetCrossRef
31.
go back to reference Li J, Liu H, Ng SK, Wong L (2003) Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics 19(suppl_2):ii93–ii102 Li J, Liu H, Ng SK, Wong L (2003) Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics 19(suppl_2):ii93–ii102
32.
go back to reference Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(3):754–764CrossRef Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(3):754–764CrossRef
33.
go back to reference Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914CrossRef Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914CrossRef
34.
go back to reference Reddy SVG, Reddy KT, Kumari VV, Varma KV (2014) An SVM based approach to breast cancer classification using RBF and polynomial kernel functions with varying arguments. Int J Comput Sci Inf Technol 5(4):5901–5904 Reddy SVG, Reddy KT, Kumari VV, Varma KV (2014) An SVM based approach to breast cancer classification using RBF and polynomial kernel functions with varying arguments. Int J Comput Sci Inf Technol 5(4):5901–5904
35.
go back to reference Kumar M, Rath SK (2015) Classification of microarray using MapReduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602CrossRef Kumar M, Rath SK (2015) Classification of microarray using MapReduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602CrossRef
36.
go back to reference Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537CrossRef Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537CrossRef
37.
go back to reference Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1) Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1)
38.
go back to reference Alipanahi B, Delong A, Weirauch M T, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA- binding proteins by deep learning. Nat Biotechnol 33(8):831–838CrossRef Alipanahi B, Delong A, Weirauch M T, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA- binding proteins by deep learning. Nat Biotechnol 33(8):831–838CrossRef
39.
go back to reference Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12(10):931–934CrossRef Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12(10):931–934CrossRef
40.
go back to reference Ahn T, Goo T, Lee CH, Kim S, Han K, Park S, Park T (2018) Deep learning-based identification of cancer or normal tissue using gene expression data. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 1748–1752 Ahn T, Goo T, Lee CH, Kim S, Han K, Park S, Park T (2018) Deep learning-based identification of cancer or normal tissue using gene expression data. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 1748–1752
41.
go back to reference Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, DordrechtMATHCrossRef Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, DordrechtMATHCrossRef
42.
go back to reference Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487–501MATHCrossRef Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487–501MATHCrossRef
43.
go back to reference Lin HY (2016) Gene discretization based on EM clustering and adaptive sequential forward gene selection for molecular classification. Appl Soft Comput 48:683–690CrossRef Lin HY (2016) Gene discretization based on EM clustering and adaptive sequential forward gene selection for molecular classification. Appl Soft Comput 48:683–690CrossRef
44.
go back to reference Lin HY (2018) Reduced gene subset selection based on discrimination power boosting for molecular classification. Knowl-Based Syst 142:181–191CrossRef Lin HY (2018) Reduced gene subset selection based on discrimination power boosting for molecular classification. Knowl-Based Syst 142:181–191CrossRef
50.
go back to reference Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Comput 13(3):637–649MATHCrossRef Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Comput 13(3):637–649MATHCrossRef
Metadata
Title
Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm
Author
Hung-Yi Lin
Publication date
06-11-2020
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 6/2021
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03480-y

Other articles of this Issue 6/2021

The Journal of Supercomputing 6/2021 Go to the issue

Premium Partner