Regular PaperMicroarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system
Introduction
Microarray analysis and classification are very much essential for early diagnosis and treatment of the most dreaded disease like cancer. It shows the highest rate of morbidity and mortality in economically developed countries and stands second in developing countries [1]. Mostly, the human beings suffer from 200 types of cancer and the microarray technology is adopted to keep records of them [2]. The GLOBOCAN database, World Health Organization, Global health observatory and United Nations World population prospectus report that the four most common cancers occurring worldwide are lung, female breast, bowel and prostate cancer [3]. It causes abnormal and uncontrolled cell growth. It is related to genome and caused by oncogenes. Molecular analysis reveals that different cancer types will have different gene expression profiles [4], [5] and these may then be utilized to diagnose different cancers. High-density DNA microarray measures the activities of several thousand genes in a parallel way. This new approach helps in giving better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy [5]. Early detection of any type of cancer increases the chance of survival for the victim. This detection is often formulated as a classification problem [6].
Microarray technology produces big datasets with gene expression values for thousands of genes (6000–60,000) in a cell mixture [7]. Hence, it becomes economically prohibitive to have a large sample size. This phenomena is called as a curse of dimensionality where samples (n)《 the number of features (p) [8]. To overcome this problem, microarray medical datasets need dimension reduction [8]. Dimensionality reduction methods are broadly classified into two types i.e. feature extraction [6], [7] and feature selection [8], [9], [10], [11]. During feature extraction, the features are projected into a new feature space with low dimensionality where the new features are generated as the combinations of original features. Widely-used feature extraction techniques are principal component analysis (PCA) [12], [13], [14], kernel principal component analysis (KPCA) [14], linear discriminate analysis (LDA) [12], [13] and canonical correlation analysis (CCA) [15]. On the other hand, feature selection method helps in selecting a subset of highly discriminating features from the original feature set without any transformation. Hence,feature selection is superior to feature extraction in terms of better readability and interpretability [11].
Feature selection algorithms are classified into supervised, unsupervised and semi-supervised depending on the presence or absence of class [9]. Supervised feature selection method includes filter, wrapper and embedded models. Filter models do not use any classifier [9]. This technique evaluates the significance of features by looking at the intrinsic properties of the data. In this approach, all the features are scored and ranked based on certain statistical criteria. Accordingly, features with highest ranking values are selected and the low scoring features are removed. As compared to other feature selection methods, filter methods are faster but they have three major limitations: (1) they ignore the interaction with the classifier; (2) each feature is considered independently thus ignoring feature dependencies; and (3) it is very difficult to determine the threshold point for ranking the features.
The wrapper model uses a predictive accuracy of a predetermined learning algorithm to determine the quality of selected features. This method is computationally expensive to run big datasets with large number of features. The embedded model bridges the gap between these two models by taking the advantages from both the techniques [9]. Feature selection methods proposed in the literature are fast correlation based filter (FCBF) [16], relief algorithm [17], support vector machine recursive feature elimination [18], sequential forward selection (SFS) [19] and sequential backward elimination (SBE) [19]. Amongst all the methods, SFS and SBE are extensively used due to their simplicity and low computational overhead. But they also have their own limitations. The major drawback of the sequential search method is the nesting effect i.e. in backward search when a feature is deleted it cannot be reselected and in forward search when a feature is selected, it cannot be deleted [20]. That is why the stochastic search strategy is adopted where some randomness is introduced in the search process and the feature selection process becomes less sensitive to the particular dataset. The most popular stochastic methods of feature selection are genetic algorithm [21], simulated annealing [22], ant colony optimization [23], particle swarm optimization [24], [25], [26], differential evolution [27], [28], bacterial foraging optimization [29], harmony search [30], cuckoo search [31], firefly [32], bat algorithm [33] and cat swarm optimization [34]. So,major advantages of feature selection method are selection without transformation, better readability and decrease in computational overhead [6].
Dimensionality reduction helps in the classification of microarray medical datasets by improving its accuracy. The important role of medical data classifier is to provide the explanation and justification for the accurate prediction of the disease [6]. Many traditional classifiers like KNN [35], naïve-bayes (NB) [36], decision tree [37], random forest [38], ID3 [39], C4.5 [40] and various neural network based classifiers like multilayer perceptron (MLP) [41], RBFNN [42], FLANN [43], SVM [44], [45], [46], [47] are found in the literature. Amongst all the classifiers, ANN and its variant are extensively used by researchers to classify medical datasets [48]. The success of the ANN based classifier is mostly dependent on the number of hidden layers, number of nodes in each hidden layer, values of the weights between input to hidden layers, hidden to output layer and the learning algorithms. In the literature, it is generally seen that when ANN is associated with gradient descent learning algorithm the performance of the model becomes time consuming. It also increases the computational overhead [49]. Beside this, due to the initial random choice of parameters, the convergence rate of the gradient descent learning algorithm becomes very slow and most often it gets trapped in the local minima. To avoid the above said limitations, pseudo-inverse based neural network [50], [51], [52], [53], [54], [55] has been proposed by many researchers like Schmidt [54], Pao [50], Broomhead and David Lowe [51]. Pseudo-inverse based neural network is recently re-named as extreme learning machine (ELM) [56] with the bias in ELM set to zero. However, this paper explores the possibility of using kernel ridge regression (KRR) [57], [58] that is recently renamed as kernel ELM [59] for microarray data classification. The architecture of ridge regression has some similarity with RVFL [52] and pseudo-inverse based neural network as it uses randomly assigned input weights between the input layer and hidden layer and the weights between the output layer and hidden layer are learnt using a pseudo-inverse formulation. However, ridge regression produces a large variation in the classification accuracy in different trials with the same number of hidden nodes. But kernel function addresses this problem by replacing the hidden layer of the ridge regression. The main advantage of kernel ridge regression is that the kernel function does not need to satisfy Mercer׳s theorem and there is no need of any randomness in assigning the connection weights between input and hidden layers. Literature suggests that kernel ridge regression is very much similar to kernel pseudo-inverse based neural network (KPINN) [58]. It exploits the concept of quadratic program algorithms for convex optimization from mathematical programming. It also borrows the idea of kernel representations from mathematical analysis and adopts the objective of finding maximum margin classifier from machine learning theory [60].
This paper proposes a modified cat swarm optimization (MCSO)technique to select the most optimal features from microarray medical datasets and kernel ridge regression (WKRR and RKRR) to classify the features obtained from MCSO algorithm. Literature in this domain also shows that CSO performs better than PSO though its computational complexity is higher than PSO [61]. In addition to it, both PSO and DE [62] sometime get influenced by parameter convergence and stagnation problem [63] which is not there in CSO. Further, modified cat swarm optimization based feature selection method (MCSO) that is capable of improving search efficiency within the entire problem space has been used to get best optimal candidate features from the high dimensional microarray medical dataset. The proposed feature selection methods employ k-nearest neighbor algorithm as the classifier and use five-fold cross validation technique to determine the classification accuracy.
The paper is organized as follows; 2 The process model for the classification of microarray datasets, 3 Datasets describe the process model and benchmark microarray medical datasets respectively. Section 4 deals with modified cat swarm optimization based feature selection method (MCSO). All the classifiers used in this study i.e. RR, OSRR, KRR, SVM and random forest, etc. are discussed in Section 5. Performance evaluation measures are presented in Section 6. Simulation results and analysis appear in 7 Simulation results, 8 Result analysis. Finally, conclusion is drawn in Section 9.
Section snippets
The process model for the classification of microarray datasets
All the microarray medical datasets are normalized using max–min normalization method as shown in Eq. (1). The modified cat swarm optimization algorithm (MCSO) is used to select the optimal feature subsets from these normalized datasets. For each dataset, MCSO is used to derive 10 subsets consisting of 10–100 genes in the interval of 10. To get the most optimal candidate features, k-nearest neighbor (KNN) classifier is considered to find out the classification accuracy. The subset with lower
Datasets
This section introduces eight benchmark microarray [64], [65], [66], [67], [68], [69], [70], [71] datasets downloaded from http://www.gems-system.org [64] and http://datam.i2r.a-star.edu.sg/datasets/krbd/ [65]. Out of eight datasets, four are binary: breast cancer, prostate cancer, colon tumor and leukemia. Other four are multi-class leukemia1, leukemia2, brain tumor1 and SRBCT. Each dataset is divided into two data files i.e. training and testing. The output is 0 or 1 for binary classification
Feature selection
Feature selection in itself is one of the important research areas in the domain of machine learning. The main advantage of feature selection is getting the most optimal candidate features that help in improving classification accuracy and reducing computational overhead, resource demand and storage space requirement. In the process the most influential features get selected so that the user can be able to interpret the relation between the features and classes [8], [25]. The idea of applying
RR, OSRR, KRR,SVM and random forest classifiers
This section discusses all five classifiers – RR, OSRR, KRR, SVM, Random Forest – used to classify both binary and multi-class microarray medical datasets.
Performance evaluation measures
Performance of all the classifiers are evaluated by different measures like training accuracy, testing accuracy, confusion matrix, receiver operating characteristic curve (ROC), sensitivity, specificity, Gmean, and F-score [76], [77], [78], [79].
Simulation results
This section discusses results obtained from all seven models − RR, OSRR, WKRR, RKRR, SVMRBF, SVMPoly and random forest – for both binary and multi-class microarray medical datasets. Due to space constraint few results are omitted. All the classifiers are implemented in the following environment, Operating System: Windows XP professional, CPU: Intel Core i3-370M (2.4 GHz), and Memory: 4GB RAM.
The number of hidden nodes ‘L’ in RR is varying from 2 to 22 with an increment of 2 neurons each time. A
Result analysis
In this paper, binary and multi-class microarray medical datasets are classified using RR, OSRR, WKRR, RKRR, SVMRBF, SVMPoly and random forest models. The meta-heuristic algorithm CSO and MCSO are used to select the most optimal candidate subsets. Number of gene selected and classification accuracy obtained from KNN in feature selection step are analyzed and compared. From Table 13, it is clearly established that MCSO yields better results as compared with CSO.
Testing accuracy obtained from KRR
Conclusion
Microarray data analysis and classification are very much essential for the effective diagnosis of cancer. However, microarray medical datasets always suffer from curse of dimensionality. So to select the most relevant features, MCSO has been proposed and compared with CSO. The proposed technique proves to be a better one. The selected features have been classified applying two variations of KRR, WKRR and RKRR. Other models like RR, OSRR, SVM (both SVMRBF and SVMPoly) and Random Forest have
References (80)
Cancer incidence and mortality patterns in Europe: estimates for 40 countries in 2012
Eur. J. Cancer
(2013)- et al.
A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets
Artif. Intell. Med.
(2011) - et al.
A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets
Artif. Intell. Med.
(2011) A review of microarray datasets and applied feature selection methods
Inf. Sci.
(2014)- et al.
EEG signal classification using PCA, ICA, LDA and support vector machines
Expert Syst. Appl.
(2010) SVM-based CAD system for early detection of the Alzheimer׳s disease using kernel PCA and LDA
Neurosci. Lett.
(2009)- et al.
A GA-based feature selection and parameters optimizationfor support vector machines
Expert Syst. Appl.
(2006) Parameter determination of support vector machine and feature selection using simulated annealing approach
Appl. Soft Comput.
(2008)- et al.
An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system
Appl. Math. Comput.
(2008) - et al.
A discrete particle swarm optimization method for feature selection in binary classification problems
Eur. J. Oper. Res.
(2010)
An improved particle swarm optimization for feature selection
J. Bionic Eng.
Feature subset selection using differential evolution and a wheel based search strategy
Swarm Evolut. Comput.
Efficient training and improved performance of multilayer perceptron in pattern classification
Neurocomputing
Breast mass classification based on cytological patterns using RBFNN and SVM
Expert Syst. Appl.
Predicting breast cancer survivability: a comparison of three data mining methods
Artif. Intell. Med.
Evolutionary generalized radial basis function neural networks for improving prediction accuracy in gene classification using feature selection
Appl. Soft Comput.
Learning and generalization characteristics of the random vector functional-link net
Neurocomputing
A novel artificial neural network method for biomedical prediction based on matrix pseudo-inversion
J. Biomed. Inf.
IIR system identification using cat swarm optimization
Expert Syst. Appl.
Cat swarm optimization algorithm for optimal linear phase FIR filter design
ISA Trans.
GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data
Int. J. Med. Inf.
The improvement of breast cancer prognosis accuracy from integrated gene expression and clinical data
Expert Syst. Appl.
An experimental comparison of performance measures for classification
Pattern Recognit. Lett.
The rising burden of cancer in the developing world
Ann. Oncol.
Cancer statistics
CA Cancer J. Clin.
Classification of multiple cancer types by multicategory support vector machines using gene expression data
Bioinformatics
An Introduction to Feature Extraction Feature Extraction
A review of feature selection techniques in bioinformatics
Bioinformatics
Comparative Study of Kernel Based Classification and Feature Selection Methods with Gene Expression Data
Graph-driven feature extraction from microarray data using diffusion kernels and kernel CCA
Adv. Neural Inf. Proces. Syst.
Gene selection algorithm by combining reliefF and mRMR
BMC Genom.
Multiple SVM-RFE for gene selection in cancer classification with expression data
IEEE Trans. Nanobiosci.
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
IEEE Trans. Pattern Anal. Mach. Intell.
Solutions to instability problems with sequential wrapper-based approaches to feature selection
J. Mach. Learn. Res.
Selecting optimal feature set in high-dimensional data by swarm search
J. Appl. Math.
Face recognition using bacteria foraging optimization-based selected features
Int. J. Adv. Comput. Sci. Appl.
Cited by (117)
Pattern recognition frequency-based feature selection with multi-objective discrete evolution strategy for high-dimensional medical datasets
2024, Expert Systems with ApplicationsSparse low-redundancy multilabel feature selection based on dynamic local structure preservation and triple graphs exploration
2024, Expert Systems with ApplicationsA systematic review on the potency of swarm intelligent nanorobots in the medical field
2024, Swarm and Evolutionary ComputationAn improvised nature-inspired algorithm enfolded broad learning system for disease classification
2023, Egyptian Informatics JournalPrivacy-enhanced and non-interactive linear regression with dropout-resilience
2023, Information Sciences