Next Article in Journal
Isochoric Supercooling Organ Preservation System
Previous Article in Journal
Development of a Machine Learning Model of Postoperative Acute Kidney Injury Using Non-Invasive Time-Sensitive Intraoperative Predictors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluation and Exploration of Machine Learning and Convolutional Neural Network Classifiers in Detection of Lung Cancer from Microarray Gene—A Paradigm Shift

1
Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam 638401, India
2
Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam 638401, India
*
Author to whom correspondence should be addressed.
Bioengineering 2023, 10(8), 933; https://doi.org/10.3390/bioengineering10080933
Submission received: 6 July 2023 / Revised: 3 August 2023 / Accepted: 4 August 2023 / Published: 6 August 2023
(This article belongs to the Section Biosignal Processing)

Abstract

:
Microarray gene expression-based detection and classification of medical conditions have been prominent in research studies over the past few decades. However, extracting relevant data from the high-volume microarray gene expression with inherent nonlinearity and inseparable noise components raises significant challenges during data classification and disease detection. The dataset used for the research is the Lung Harvard 2 Dataset (LH2) which consists of 150 Adenocarcinoma subjects and 31 Mesothelioma subjects. The paper proposes a two-level strategy involving feature extraction and selection methods before the classification step. The feature extraction step utilizes Short Term Fourier Transform (STFT), and the feature selection step employs Particle Swarm Optimization (PSO) and Harmonic Search (HS) metaheuristic methods. The classifiers employed are Nonlinear Regression, Gaussian Mixture Model, Softmax Discriminant, Naive Bayes, SVM (Linear), SVM (Polynomial), and SVM (RBF). The two-level extracted relevant features are compared with raw data classification results, including Convolutional Neural Network (CNN) methodology. Among the methods, STFT with PSO feature selection and SVM (RBF) classifier produced the highest accuracy of 94.47%.

Graphical Abstract

1. Introduction

Lung cancer remains one of the most significant health concerns worldwide, with high morbidity and mortality rates. It is primarily associated with exposure to tobacco smoke, either through active or passive smoke. However, as mentioned in Dubin et.al. [1], non-smokers may also develop lung cancer due to environmental pollutants, occupational hazards, genetic predisposition, and underlying lung diseases. The risk increases with prolonged exposure to carcinogens, such as asbestos, radon, and certain industrial chemicals. In recent years, there has been increasing emphasis on early detection and intervention of lung cancer. At early stages, the tumor is localized; hence the treatment options are more effective, and the chances of successful intervention and cure are significantly high. Unfortunately, as sermonized in Selman et al. [2], lung cancer often remains asymptomatic or presents nonspecific symptoms until it reaches an advanced stage, making early detection challenging.
Various detection methods have been employed to identify lung cancer early. Infante et al. [3], utilized chest X-rays and Computed Tomography (CT) scan imaging techniques to detect suspicious lung nodules, masses, or other abnormalities that may indicate the presence of lung cancer. The Sputum Cytology method used in Thunnissenet al. [4] analyzes the sputum sample for the presence of cancer cells. There are also procedures such as Bronchoscopy studied in Andolfiet al. [5] that allow the collection of lung tissue samples and visualization of the airways using a thin, flexible tube and camera. All of the above methods or procedures are conducted in the primary stages after the spread of lung cancer cells in the individual. However, Zhu et al. [6] used the method of microarray-based gene expression analysis to detect lung cancer at its earliest stages, facilitating timely intervention and improving patient outcomes. The microarray analysis compares healthy lung tissue’s gene expression profiles with cancerous tissue. The analysis aids the researchers in identifying distinct patterns and signatures associated with different lung cancer types.

1.1. Review of Previous Works

As mentioned, microarray based gene expression analysis is widely used for the detection and classification of various health conditions. A microarray is a tool used in molecular biology to study gene expression. As described by Churchill et al. [7], microarray is a collection of microscopic spots containing DNA molecules representing individual genes. These spots are arranged in a grid pattern on a small glass slide or a silicon chip. Microarray gene expression analysis can investigate changes in gene expression patterns in cancer cells compared to normal cells, as sermonized in Ross et al. [8]. By simultaneously measuring the expression levels of thousands of genes, microarrays can provide a comprehensive view of the changes in cancer cells. Also, as reported by Reis et al. [9], microarray gene expression analysis can be utilized to classify different types of cancer, which can help and guide treatment decisions at early stages.
As indicated in Lapointe et al. [10], different types of cancer have distinct gene expression profiles, which reflect the underlying molecular and genetic changes that drive the disease. In Dwivedi et. al. [11], researchers have used microarray analysis to identify gene expression patterns that distinguish between different types of leukemia, such as acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Similarly, Rody et al. [12] used microarray analysis to distinguish between different breast cancer subtypes, such as estrogen receptor-positive and HER2-positive breast cancer. In Sanchez et al. [13], microarray gene expression analysis is employed to predict the chances of tumor and lung cancer. Kerkentzes et al. [14] classified Adenocarcinoma (Adeno) and Mesothelioma (Meso) lung cancer from microarray genes. Adeno and Meso are considered serious cancer conditions for diverse reasons mentioned in Wagner et al. [15]. Adeno can exhibit rapid tumor growth and potentially spread to other body parts, including distant organs. Meso is extremely aggressive, making the treatment difficult when diagnosed at advanced stages. Both Adeno and Meso have limited treatment options compared to other forms of lung cancer. Although surgery, chemotherapy, and radiation therapy can be used, these treatments may not always be curative, especially if the cancer has spread all over the body.
Further, these cancer conditions develop resistance to standard cancer treatments due to genetic mutations in tumor cells, limiting the effectiveness of targeted therapies. Above all, the significance of this research investigation is the poor prognosis of Adeno and Meso conditions compared to other forms of lung cancer. The outcome of this research will help in timely detection, early intervention, and personalized treatment approaches to improve the conditions of cancer-diagnosed individuals. Microarray gene expression analysis-based lung cancer detection offers several advantages over traditional histopathology diagnostic methods, as sermonized in Weigelt et al. [16]. Unlike histopathology, gene expression analysis provides molecular insights into the underlying biology of lung cancer, which helps to understand lung cancer subtypes, disease progression, and treatment responses. The molecular information goes beyond the structural or morphological features captured by imaging or histopathology, providing a deeper understanding of the disease. Also, traditional diagnostic methods may only detect lung cancer when it has already reached a more advanced stage. Microarray gene expression analysis also brings up potential biomarkers with higher sensitivity and specificity than traditional diagnostic methods. All of these benefits motivate us to use microarray gene expression analysis for early diagnosis of lung cancer.

1.2. Review of Feature Extraction Techniques

Feature extraction techniques are crucial in acquiring relevant features from microarray gene data. As described in Hiraet al. [17], these methods aim to reduce the dimensionality of the data while retaining important information for subsequent analysis. The high dimensional data are reduced so that it can explain most of the variance in the dataset. The reduced data also contains the essential patterns and relationships between genes, entitling a more manageable and meaningful representation of the data. Feature extraction methods are advantageous in removing noise by extracting the underlying signal or patterns in the data by focusing on the most significant features.
In this paper, we employ Short-Time Fourier Transform (STFT) for analyzing and expressing the microarray gene expression data. The STFT provides a time-frequency representation of the signal, which can capture changes in gene expression over time. The STFT provides information about the temporal localization of signal events, allowing researchers to identify specific time points or intervals where gene expression changes occur as mentioned in Qi et al. [18]. This representation can also reveal temporal patterns and relationships between genes, allowing for the identification of dynamic gene expression changes associated with specific conditions or biological processes. The STFT can also extract frequency-domain features from gene expression data. These features provide additional information about the distribution or characteristics of gene expression patterns that help to improve classification accuracy. Also, STFT identifies specific frequency bands representing relevant genes associated with biological processes and disease conditions.

1.3. Review of Feature Selection Techniques

After feature extraction, feature selection methods further improve lung cancer classification from microarray gene expression data. As described in Abdelwahab et al. [19], these methods help identify a subset of relevant genes that are informative for distinguishing between different lung cancer types. In this paper, we employ metaheuristic feature selection methods namely Particle Swarm Optimization (PSO) and Harmonic Search (HS) for selecting this subset of genes most discriminatory for lung cancer classification. As elaborated in Shukla et al. [20], by focusing on the most discriminative genes, these metaheuristic methods can reduce the risk of overfitting and improve the generalization capability of the classifier, leading to better performance. Also, this is extremely useful for classifying the LH2 dataset, as the number of samples is limited. Metaheuristic feature selection methods can also handle correlations and dependencies between genes to remove redundant genes while maximizing their relevance to lung cancer classification. Overall, the metaheuristic feature selection methods identify a biologically relevant gene subset and can give insights into the underlying biological processes associated with lung cancer. But, the choice of metaheuristic algorithm for the feature selection method plays a vital role in selecting the relevant genes from the dataset. So it is important to evaluate different metaheuristic methods and consider their performance, computational complexity, and suitability for the specific problem.

1.4. Review of CNN Methodology

Apart from machine learning classifiers, Convolutional Neural Networks (CNNs) may make a significant difference in the classification of lung cancer from microarray gene expression data. Without manual feature extraction and selection methods, CNNs can automatically learn relevant features from raw input data. In the context of gene expression data, CNNs can extract meaningful patterns and relationships between genes. This methodology allow CNNs to capture complex interactions that may be difficult to discern manually. For gene expression data, genes can be considered spatial entities, and their expressions across samples can be treated as spatial patterns. Here, CNNs excel at learning hierarchical representations, starting from low-level gene expression features and gradually building up to higher-level features that capture more abstract characteristics. Also, CNNs can identify gene expression patterns relevant to lung cancer, even if they occur in different regions or orders within the microarray. CNNs employ parameter optimization and deduce conclusions better from limited training data like microarray gene expression datasets. Further, CNNs are robust to inherent noise, variability, and batch effects variations. CNNs can learn to identify important patterns despite these variations, making them suitable for analyzing microarray gene expression datasets. Also, CNNs models inherently form nonlinear relationships between features crucial for capturing the complex and nonlinear interactions within gene expression data. So we also use experiments on employing CNN methodology to classify lung cancer from microarray gene expression data. The overall methodologies for dimensionality reduction and classification of microarray gene expression data for lung cancer employed in this paper are shown in Figure 1.
Despite the numerous benefits of using CNNs for classifying lung cancer from microarray gene expression data, they also have certain limitations. Even though CNNs are good at learning features, it is often challenging to interpret the physical significance behind the prediction and classification of gene expression patterns from the learned features due to their black-box nature. Further, CNNs are usually prone to overfitting due to the limited dataset availability; hence, it is important to carefully tune and select the parameters in the CNN model. The feature extraction and selection methods can solve the overfitting problem to a certain extent in CNN models. However, these stages also add computational complexity to CNNs’ inherent complexity. For the microarray gene expression dataset used in this paper (Lung Harvard 2 Dataset), the imbalanced class distributions in the lung cancer dataset can pose challenges. Due to class imbalance, CNNs may tend to bias towards the Adeno class, leading to lower accuracy for the Meso class. The performance of CNN can be improved by carefully choosing tuning parameters like learning rate, network architecture, and regularization strength. Also, suitable classifiers are integrated into the softmax layer to improve classification performance in CNN.

1.5. Description of the Dataset

In this research, we have utilized the Lung Harvard 2 (LH2) medical imaging dataset [21]. This dataset contains translated microarray data of patient samples affected by Adenocarcinoma (Adeno) and Mesothelioma (Meso). The gene expression ratio technique is the translation technique used in the dataset for a simpler representation. The technique can accurately differentiate between genetically disparate and normal tissues based on thresholds. The dataset contains a total of 150 Adeno and 31 Meso tissue samples. The 12533x 1 genes characterize each of the subject. The final row of the dataset is used for labelling Adenocarcinoma and malignant Mesothelioma samples. This microarray gene expression dataset is widely used for early and authentic diagnosis of lung cancer classification.

1.6. Relevance of This Research

As mentioned earlier, lung cancer detection from microarray gene data helps to diagnose cancer at an earlier stage, which is crucial as the treatment options are limited in advanced stages. Thus, early detection can significantly improve patient outcomes and survival rates. In this paper, we perform the lung cancer classification of Adeno and Meso cancer from the LH2 dataset. The dataset is of high volume and contains inherent nonlinearity and class imbalance. Therefore, it is important to adopt suitable preprocessing and classification methods to enhance the classification performance. Thus, learning on microarray gene data analysis enables the identification of potential biomarkers associated with lung cancer to help in diagnosis, prognosis, and monitoring treatment response.

2. Methodology

In this research, a two-level strategy is adopted to improve the overall efficiency of the classifiers. First, feature extraction is performed using STFT to solve the curse of dimensionality problem. The dimension of the dataset is reduced to leverage the computational overhead of the classifier. Further, the second step involves selecting features from PSO and HS metaheuristic algorithms. The feature selection step further improves the classification methodology by selecting relevant features and patterns and reducing the risk of overfitting and underfitting.

2.1. Feature Extraction Using STFT

The Short-Time Fourier Transform (STFT) is a frequency domain analysis of the information over a short period. The STFT, when applied to microarray gene data, can reduce the data dimension by extracting relevant and useful features. Gupta et al. [22] have performed QRS Complex Detection Using STFT. The authors have identified that STFT is useful in providing a time-frequency representation of the data, which helps researchers investigate how gene expression levels change over time and across different frequency components. Identifying specific time intervals where certain frequency components are prominent is possible with STFT, giving a localized representation of frequency content over time. For microarray gene expression analysis, this feature of STFT is useful in identifying genes that exhibit temporal patterns active during specific time intervals. Moreover, STFT reduces dimensionality by extracting the important genes or gene clusters associated with specific frequency components. The dimensionality of the microarray gene expression dataset is reduced to 2049 × 150 and 2049 × 31 for Adeno and Meso subjects, respectively. For performing STFT, the Blackman window that minimizes spectral leakage is used as the windowing function. Thus, STFT can provide biological insights by finding the frequency patterns and fundamental relationships in the dataset.
X m , w = n = x n   w n m e j w n
Here, x[n] represents the input data having length N with n = 0, 1, 2, … N−1. The STFT window is represented by w[n] having length M with m = 0, 1, 2, … M−1, j = √ (−1) denotes the complex number. The Blackman window is given by the expression
w n = 0.42 0.5   cos 2 π n M + 0.08   cos 4 π n M ,     0   n   N 1
Next section is an analysis on the various statistical parameters associated with STFT dimensionality reduction.

2.2. Statistical Analysis on STFT

After the dimensionality reduction methods on microarray genes, the resultant outputs are analyzed by the statistical parameters such as mean, variance, skewness, kurtosis, Pearson correlation coefficient (PCC), f-test, t-test, p-value, and Canonical correlation Analysis (CCA) to identify whether the outcomes are representing the underlying microarray genes properties in the reduced subspace. Table 1 shows the statistical features analysis for the STFT Dimensionally Reduced (DR) Adeno Carcinoma and Meso Cancer Cases of microarray genes. As mentioned in Table 1, the STFT features show higher mean, variance and kurtosis values among the two classes. Skewness values are observed to be similar with positive skewness. PCC values indicate the high correlation within the class of the attained outputs. The permutation and sample entropy for the Adeno and Meso cancer cases are similar. The f-test, t-test and p-value exhibit no significant nature of the dimensionally reduced outputs of the microarray gene. The statistical f-test and t-test are performed for the Null Hypothesis. The t-test value is low for the Meso cancer class but does not indicate significance as the p-value is higher than 0.01. This statistical analysis indicates that the values are associated with non-Gaussian and nonlinear ones. The histogram, Normal probability plots, and Scatter plots of DR techniques output further examine the same. CCA visualizes the correlation of DR methods outcomes among the Adeno Carcinoma and Meso Cancer Cases. The low CCA values in Table 1 indicate that the DR outcomes are less correlated among the two cancer classes.
Figure 2 explores the Scatter plot for the STFT-based Dimensionality Reduction Method in Meso and Adeno Carcinoma Cancer Classes.
After applying STFT-DR, the Meso and Adeno Carcinoma Cancer dataset seems clustered and overlapped in the scatter plot as most data are amalgamated into a single group. The data after STFT-DR also does not show any signs of Gaussian distribution. Therefore, the scatter plot emphasizes the need for the feature selection process that enhances the classification accuracy of the classifiers.

2.3. Feature Selection Using Metaheuristic Algorithms

As described before, the dimensionality reduction step reduces computational complexity and removes noise in the data. The feature selection step further reduces the number of features after the dimensionality reduction step making subsequent computations faster and more efficient. Also, the feature selection step acts as one more noise filter stage that captures only the relevant features that contribute to the underlying structure of the data. The feature selection step reveals the relevant features from the dimensionally reduced data. This subset of the significant features is the most informative one that holds the gene patterns and functionality to develop a simpler model. The feature selection using metaheuristic algorithms helps to overcome the problems of computationally expensive and noise-hidden datasets by employing efficient search strategies. The algorithms, namely PSO (Particle Swarm Optimization) and Harmony Search (HS), are known for effectively handling high dimensional and complex datasets like microarray gene datasets. Metaheuristic algorithms reduce the microarray gene expression datasets considering the original number of samples and the required number of genes. The reduced dataset contains complex relationships between genes and their associations with biological processes. Metaheuristic algorithms evaluate the collective behavior of the population to arrive at optimal feature subsets. The optimal feature subset assists the subsequent stages, like disease classification.
PSO is an optimization algorithm inspired by the food search and social behavior of a flock of birds. When applied in the feature selection step, the PSO explores and refines the population by moving through the entire search space, balancing exploration and exploitation. Over the iterations, PSO gradually refines the solution space towards the global optimum. The implementation of PSO is simple and efficient as it involves fewer control parameters than other metaheuristic algorithms. Therefore, PSO converges relatively fast, making it computationally efficient for feature selection from large datasets like microarray gene expression datasets. The HS, on the other hand, is inspired by the process of improvisation in music. The generation of harmonious musical notes represents potential solutions. The HS employs a unique approach to enhance the search space exploration by dodging local optima. The HS incorporates adaptive parameters that dynamically change during the search process. This behavior is useful because HS can easily adapt to the characteristics of the microarray gene expression dataset. The HS is also robust to noise and nonlinearities in data, making it a suitable method for feature selection from microarray datasets. So, based on the peculiar characteristics possessed by PSO and HS, these algorithms are used as the feature selection step in this research to understand gene interactions, capture complex relationships, balance exploration and exploitation, and thus identify the most relevant genes. Both objective and fitness functions in the PSO problem are to minimize the Mean Square Error (MSE) of the training process. The relevant features selected after the metaheuristic method will further reduce the dimensionality of the STFT feature extracted values to 2049 × 30 and 2049 × 6 for Adeno and Meso cancer subjects, respectively.
The normal probability plot for the STFT DR method followed by the PSO feature Selection technique for Adeno Carcinoma Cancer Classes is shown in Figure 3. It is also observed from the Figure 3 that the PSO feature selection turned around the STFT features into a clustered one with a sigmoid structure having a mean value of 2.56. The normal plot also implies that the PSO features are non-Gaussian and exhibit minimum segregation. The data 1 through data 5 represents the reference line. The data 6 through data 10 represents the upper bound. The data 11 through data 15 are the feature selected PSO values for the Adeno class.
The normal probability plot for the STFT DR method followed by the PSO feature Selection technique for Meso Carcinoma Cancer Classes is shown in Figure 4. It is also observed from Figure 4 that the PSO feature selection clustered the STFT features around its mean value of 2.47. The normal plot also looks like a sigmoid structure. The normal plot also implies that the PSO features are non-Gaussian and exhibit minimum segregation. The data 1 through data 5 represents the reference line. The data 6 through data 10 represents the upper bound. The data 11 through data 15 are the feature selected PSO values for the Meso class.
Figure 5 displays the histogram of the STFT DR method followed by Harmonic Search Feature selection techniques for Adeno Carcinoma Cancer classes. The histogram also depicts harmonic search features with outliers, wide gaps, down trends and non-Gaussian nature. The STFT dimensionality reduced Harmonic Search feature selection data for four Adeno Carcinoma class patients is denoted by x(:,1), x(:,2), x(:,3), and x(:,4).
Figure 6 exhibits the histogram of the STFT DR method followed by Harmonic Search Feature selection techniques for Meso Carcinoma Cancer classes. The histogram also displays harmonic search features with outliers, wide gaps, skewed, and non-Gaussian nature. The STFT dimensionality reduced Harmonic Search feature selection data for four Meso Carcinoma class patients is denoted by x(:,1), x(:,2), x(:,3),and x(:,4).
Table 2 shows the analysis of the Friedman test in Feature selection methods on STFT Data. It is observed from Table 2 that the PSO and Harmonic Search for the both cancer classes and the feature selection methods depicts the non-significance p-values.
The Friedman test is one type of statistical test which tests the stationarity and two-way analysis of variance. It is an extension of the sign test for matched pairs and is used when the data arise from more than two related samples. It is applied to determine whether any dependent variable significantly affects the feature selection method. It is employed in this research as a measuring tool to compare the performance of feature selection algorithms on the same dataset and assess their relative performance. Thus, the Friedman test helps to identify the most effective method among the evaluated set. Since it is a non-parametric test, it is particularly useful when data have heterogeneity or are not following a normal distribution. The Friedman test also provides a statistical measure to determine whether the observed differences in performance are significant. Overall, the test is useful in deciding the relative efficacy of the feature selection techniques. At first, suitable parameters attained from the Friedman test are decided, and performance is evaluated multiple times to account for variability. After that, methods are ranked such that the best method receives a rank of 1, the second best receives a rank of 2, and so on. Consecutively average rank is also calculated. Finally, the Friedman statistic is computed using the following expression.
F r i e d m a n   s t a t i s t i c s = 12 n   k ( k + 1 ) i = 1 k R i 2 3 n ( k + 1 )
Here, k represents the number of feature selection methods, n denotes the number of repetitions, and R indicates the average rank of each method. This research measures the efficacy of feature selection methods, namely PSO and HS, on STFT data. The parameters from the Friedman test are X2r statistic, p-value, and its significance for p < 0.01, confidence level.

3. Classification Using Machine Learning Techniques

This research employs various machine learning classifiers for lung cancer classification. The choice of the classifier is very significant in the classification methodology. Martín et al. [23] employed a Nonlinear Regression classifier for heterogeneous and highly nonlinear datasets related to the manufacturing process. For this research problem, such a classifier can capture complex nonlinear dependencies between gene expression levels and class labels is suitable. Linear classifiers cannot effectively capture these complex patterns and nonlinear decision boundaries. Also, as discussed in Dalatu et al. [24], Nonlinear regression models can be more robust to outliers than linear classifiers. Ficklin et al. [25] have utilized Gaussian Mixture Models (GMMs) for the classification of microarray gene data for various cancer types, namely lower-grade glioma, thyroid, glioblastoma, ovarian, and bladder cancer. Since GMMs are probabilistic models, they can work effectively on microarray gene expression datasets that are combinations of multi-modal distributions. The GMMs can capture complex data distributions by representing them as a combination of Gaussian components. Further, each Gaussian component in the mixture model can represent a different class, and the combination of these components enables the modeling of nonlinear decision boundaries that can be particularly useful when large amounts of samples overlap among the classes. Shah et al. [26] used the Softmax Discriminant Classifier for lung classification from microarray gene expression. The classifier works on estimating probability for each class, and it produces a probability distribution over all possible classes. This property is useful and suitable for multi-class classification problems, such as microarray gene expression data. The Softmax Discriminant Classifier is also computationally efficient as it incorporates regularization techniques to prevent overfitting and improve generalization. Ahmed et al. [27] used Naive Bayesian Classifier (NBC) to classify head and neck cancer from microarray gene expression data. NBCs are based on probabilistic principles, specifically Bayes’ theorem. The classifier computes the probabilities of diverse classes based on the feature values and allocates the class label with the highest probability. NBC framework also facilitates the integration of prior knowledge or expert beliefs through prior probabilities. Moreover, NBCs are computationally efficient, making them particularly suitable for large-volume and sparse datasets like microarray gene expression datasets. Vanitha et al. [28] used a Support Vector Machine (SVM) for gene expression-based data classification. SVM classifiers work on Vapnik’s statistical learning theory and thus handle the curse of dimensionality problem by finding the optimal hyperplanes in high-dimensional feature space. This process maximizes the margin between classes and provides robustness to outliers. In SVM, the mapping of higher dimensional feature space is achieved through the kernel trick. SVMs often yield sparse solutions and select the most informative genes contributing to the classification task. The sparse solution reduces the computational burden and improves generalization and classification accuracy. All of the above features of SVM make it a suitable classifier for microarray gene expression datasets. The following sections discuss the classification methodology of each classifier mentioned above.

3.1. Nonlinear Regression (NLR)

The nonlinear regression classification is another variant of the linear regression classifier that associates variables in the form of y = mx + c. Unlike linear regression models, nonlinear regression models associate variables by randomly changing the variable y. NLR strives to minimize the MSE of the individual observations in the dataset. The data points are adapted to a nonlinear function through multiple iterations for building a classifier model. The Euclidean distance is initially calculated from the dataset’s target and input data through the following expression as in Dai et al. [29].
y = T j X j 2
Here, T represents target data, and X represents input data with j as the sample index. The Euclidean distance is then projected to a three-dimensional space using the following cuboid equation:
Minimize: z = k1 × y + k22 × y2 + k33 × y3
Subject   to :   k 1   > k 2   > k 3   > 0 ,   k i   [ 0 , 1 ]   f o r   i = 1 , 2   a n d   3 k 1   k 2 2 2 < 0.5 k 2   = k 1   10 , k 3   = k 1   10
Later, the minimum value of the three-dimensional space, f = min (z) is computed.
The threshold function ‘g’ for the NLR classification problem is framed with the help of the minimum value of the three-dimensional space, f = min (z) and d0 as the sum of the squares of mean deviation values as
g = f + d0
Since the calculation of d0 involves least squares method, the methodology employed here may be also called as a least square based NLR.

3.2. Gaussian Mixture Model

The Gaussian Mixture Model (GMM) is a probabilistic model for classification and clustering tasks. The model considers that the data points are generated from a mixture of Gaussian distributions. The probability Density Function (PDF) of the Gaussian distribution having mean μ and covariance Ԑ is given by Ge et al. [30].
p x = 1 ( 2 π )   n ε exp ( 1 2 x μ T Ԑ 1 ( x μ ) )
where x is the data point, n hold the dimensionality of the data, and T indicates the transpose operation. The GMM assumes that the observed data are generated from a mixture of K Gaussian distributions. The posterior probability of the kth Gaussian component with mixing coefficient πk is given by:
p k / x = π k   η   ( x   ;   μ k       ,   Ԑ k   )   i 1 k π i   η   ( x   ;   μ i       ,   Ԑ i   )
µk and Ԑk are the covariance matrix and mean vector for the kth component with k as total number of components in the mixture. In the next step, the parameters of the GMM, including the mixing coefficients, means, and covariances, will be estimated using the Maximum Likelihood Estimation (MLE) technique. In the next step, the parameters of the GMM, including the mixing coefficients, means, and covariances, will be estimated using the Maximum Likelihood Estimation (MLE) technique. The MLE estimation for the GMM is defined as the log-likelihood of the observed data.
L o g   L   θ = i = 1 N l o g k = 1 K π k     η   ( x   ;   μ k ,   Ԑ k   )
where θ represents the set of all parameters of the GMM and N is the total number of observed data points. In the next step, the expectation minimization is performed through an iterative approach by computing posterior probabilities p(k/x) for each data point and updating parameters θ by maximizing the log-likelihood, considering the computed posterior probabilities.

3.3. Softmax Discriminant Classifier

The Softmax Discriminant Classifier uses the softmax function to transform the discriminant scores into class probabilities. At first, in SDC, the discriminant scores are calculated by taking the dot product between the feature vector x and the weight vector wk for each class k, along with the bias term bk. Mathematically, the discriminant score zk for class k is given by Hastie et al. [31]
  z k = w k T x + b k
For the discriminant scores z = [z1, z2zk] for K classes, the softmax function calculates the probability of the input x belonging to class k using the following expression
P   ( y = k / x ) = exp ( z k ) i = 1 k exp ( z i )
The loss function measures the discrepancy between the predicted class probabilities and the true class labels. The typically used loss function for Softmax Discriminant Classification is the cross-entropy loss. For the training dataset with N samples, the cross-entropy loss ‘L’ is defined as
L = 1 N i = 1 N k = 1 K y k ( i ) l o g ( P ( y = k / x ( i ) ) )
where N is the number of samples in the training dataset, yk(i) is the true label for sample i, which is 1 if the sample belongs to class k and 0 otherwise, and P ( y = k / x ( i ) ) is the predicted probability of class k for sample i. Finally, the weight vector w k and bias term b k are updated using gradient descent optimization method with learning rate α as follows.
w k = w k α     L w k
b k = b k α     L b k

3.4. Naive Bayesian Classifier (NBC)

The Naive Bayesian classifier (NBC) is efficient and straightforward and works on Bayes principles. NBC is a probabilistic classification algorithm based on the Bayes theorem and feature independence assumption. First, NBC calculates the posterior probability of a class using the prior probability and likelihood. For a given class C and features x1, x2xn, posterior probability p C x 1 x 2 ,   , x n is expressed as Berrar [32].
p C x 1 x 2 , .   , x n = p C   .     p   ( x 1 x 2 , .   , x n   | C ) p   ( x 1 x 2 , .   , x n )
For the class C, p(C) represents the prior probability, p   ( x 1 x 2 , .   , x n   | C ) is the likelihood, and p   ( x 1 x 2 , .   , x n ) is the evidence probability.
In Naive Bayes approach is that the features are conditionally independent for the class. This assumption that simplifies the calculation of the likelihood can be expressed as
p C x 1 x 2 , .   , x n = p x 1 C . p x 2 C . .   .   p x n C
where p x i C is the probability of feature x i of the class C. The   p x i C is estimated from the fraction of class C training examples with the feature value x i . Then the prior probability p C can be estimated as the fraction of training examples belonging to class C. Finally, to predict the class label for the features x i , the algorithm calculates the posterior probability for each class and assigns the instance to the class with the highest probability.

3.5. Support Vector Machine Classifier (Linear)

In machine learning, Support Vector Machine (SVM) is a robust learning algorithm used for analyzing and classifying data. The SVM with a linear kernel model is the type of SVM that can solve linear problems by creating a line or hyperplane which separates the data into classes. The data that reside close to the hyperplane is considered the support vector. The expression for the hyperplane used in the binary classification problem can be written as Zhang et al. [33].
f x = w T x + b
Here, f(x) is the objective function with x as the feature vector, w as the weight vector perpendicular to the hyper plane and b as the bias term, determining the offset of the hyper plane from the origin. SVM aims to find the hyperplane that maximizes the margin. The optimization problem for SVM linear classification with margin 1 w is given by:
m a x i m i z e   1 2 w 2 Subject   to   y i ( w T x i + b ) 1   for   all   x i , y i training   data
Here, y i represents the class label (+1 or −1) for the training example with feature vector x i .
From the above primal optimization problem, the dual optimization problem is derived by finding the Lagrange multipliers or dual variables for each training example. The dual problem reduces computational burden using optimal Lagrange multipliers α i . Thus the SVM classification with the decision function flinear(x) is expressed as
f l i n e a r x = i = 1 N α i y i x i   .   x + b
If flinear(x) is positive, the instance is assigned to one class, and if it is negative, it is assigned to the other class.

3.6. Support Vector Machine Classifier (Polynomial)

SVM with a polynomial kernel is a variant of SVM that allows for nonlinear classification by mapping the input features into a higher-dimensional space using a polynomial function. The polynomial kernel function calculates the dot product between the mapped feature vectors in the higher-dimensional space. The polynomial kernel function Kpoly(x,z) with x and z are pair of feature vectors is given by Zhang et al. [33].
K p o l y x , z = ( γ   x T z + r ) d
where γ is the coefficient scaling the dot product, r is the coefficient representing the independent term, and d is the degree of the polynomial. Further, the decision function in SVM polynomial classification is defined as the linear combination of the kernel evaluations between the support vectors and the input feature vector with the bias term. The decision function f p o l y x is given by:
f p o l y x = i = 1 N ( α i y i ) K p o l y x i , x + b
Based on fpoly(x) value, the class assignment is performed in a similar way as performed in SVM (Linear) classifier.

3.7. Support Vector Machine Classifier (RBF)

The Radial Basis Function (RBF) can also be used as a kernel while using SVM classifier. Here, nonlinear mapping of the input features into a higher-dimensional space is performed using an RBF. The RBF kernel KRBF(x,z) that is used to compute the similarity between feature vectors in the input space is expressed as Zhang et al. [33].
K R B F x , z = exp x z 2 2 σ 2
where |xz| is the Euclidean distance between x and z with σ as the kernel width parameter that controls the influence of each training sample. As computed before, the decision function for SVM RBF classification is also defined as a linear combination of the kernel evaluations between the support vectors and the input feature vector with the bias term. The decision function fRBF(x) for SVM RBF is given by:
f R B F x = i = 1 N ( α i y i ) K R B F x i , x + b
Based on fRBF(x) value, the class assignment is performed in a similar way as performed in SVM linear and polynomial classifiers.

3.8. Selection of Target

This research involves a binary classification, and hence two targets TAdeno and TMeso for Adeno and Meso classification and mapping constraints must be selected. The target and mapping constraints are selected based on the distribution of classes in the dataset and the class imbalance problems involved in the microarray gene expression dataset. The target value, T A d e n o   [ 0 , 1 ] is having the following constraint. For N number of features, with µi as the average of input feature vector, the mapping constraint is framed as:
1 N i = 1 N μ i T A d e n o
TAdeno must have a target value greater than µi and the average of input feature vectors µj for the Meso case. The target value T M e s o ∈ [0, 1], for the Meso case with M number of features, the mapping constraint is framed as:
1 M j = 1 M μ j T M e s o
Also, the difference between the target values must follow the following constraint:
| | T A d e n o T M e s o | | 0.5
Thus, following all of the above mentioned constraints, the targets TAdeno, and TMeso for Adeno and Meso output classes are chosen as 0.85, and 0.65, respectively. The performance of classifiers is will be evaluated based on the Mean Squared Error (MSE).

3.9. Training and Testing of Classifiers

The training data for the dataset are limited. So, we perform k-fold cross-validation. K-fold cross-validation is a popular method for estimating the performance of a machine learning model. The process performed by Fushiki et al. [34] for k-fold cross-validation is as follows. The first step is to divide the dataset into k equally sized subsets (or “folds”). For each fold, i, train the model on all of the data except the i-th fold and test the model on the i-th fold. The process is repeated for all k folds so that each is used once for testing. At the end of the process, you will have k performance estimates (one for each fold). Now, calculate the average of the k performance estimates to obtain an overall estimate of the model’s performance. Once the model has been trained and validated using k-fold cross-validation, you can retrain it on the full dataset and predict new, unseen data. The advantage of k-fold cross-validation is that it provides a more reliable estimate of a model’s performance than a simple train-test split, as it uses all of the available data. In this paper, the k-value is chosen as 10-fold. This research used a value of 2049 dimensionally reduced features per patient. This research is associated with 150 patients of Adeno Carcinoma and 31 Meso cancer patients with multi-trail training of classifiers required. The use of cross-validation removes any dependence on the choice of pattern for the test set. The training process is controlled by monitoring the Mean Square Error (MSE), which is deduced from Fushiki et al. [34] as,
M S E = 1 N j = 1 N O j T j 2
where Oj is the observed value at time j, Tj is the target value at model j; j = 1 and 2, and N is the total number of observations per epoch. In our case, it is 2049.
Table 3 shows the confusion matrix for lung cancer detection.
In the case of lung cancer detection, the following terms can be defined:
  • True Positive (TP): A patient correctly identifies as having Adeno Carcinoma lung cancer.
  • True Negative (TN): A patient is correctly identified as having Meso cancer.
  • False Positive (FP): A patient is incorrectly identified as having Adeno Carcinoma lung cancer when they have Meso Carcinoma cancer disease.
  • False Negative (FN): A patient is incorrectly identified as having Meso Carcinoma lung cancer when they do have the Adeno Carcinoma cancer disease.
Table 4 explores the training and testing MSE performance of the classifiers without and with feature selection (PSO and Harmonic Search) Methods for STFT Dimensionality reduction techniques. The standard training MSE always varies between 1 × 10−6 and 1 × 10−8, while the standard testing MSE varies from 1 × 10−4 to 1 × 10−7. GMM classifier without feature selection method settled at minimum training and Testing MSE of 3.6 × 10−8 and 7.21 × 10−7, respectively. PSO Feature Selection method SVM (RBF) Classifier scores minimum training and Testing MSE of 2.25 × 10−9 and 3.6 × 10−8, correspondingly. Like the Harmonic Search feature Selection method, the SVM (RBF) classifier once again attained minimum Training and Testing MSE of 2.56 × 10−8 and 1.96 × 10−7, respectively. The minimum testing MSE is one of the indicators towards the attainment of better classifier performance. As shown in Table 4, the higher the value of testing MSE leads to poorer performance of the classifier, irrespective of the feature selection methods.
Table 5 explores the selection of classifier parameters during the training process. The classifiers parameters are chosen on a trial and error basis for minimum training MSE training condition. The stopping criteria for training MSE is fixed to either minimum standard MSE value or 1000 iterations. It is observed that the training MSE value reached 1.0 × 10−10 within 1000 iterations as the training progressed.

4. Classification Using CNN Methods

Convolutional Neural Networks (CNNs) are a deep learning model commonly used for image classification tasks. The CNN classification methodology adopted in this research is depicted in Figure 7. The CNN architecture employed in this research consists of a convolutional layer, a fully connected layer and a softmax layer. The convolutional layer is the building block of CNN that performs a convolution operation on the input data using a set of learnable filters. The convolutional layer consists of a convolutional 2D layer, Batch Normalization Layer, ReLU layer, and Max Pooling 2D layer. The classification methodology in CNNs involves stacking these layers in a sequential manner. Combining these layers and training the network, CNNs can effectively learn and classify complex patterns in image data.
In the convolutional 2D layer, the convolution operation is performed between the filter and local patches of the 12,533 × 181 input data. This stage produces feature maps and patterns that help to improve the stability and convergence of the network during training. Next, the batch normalization layer normalizes the mean and variance of the input by utilizing mini-batches of training data from the Max Pooling 2D layer. The Batch Normalization Layer helps in reducing internal covariate shift and allows the network to learn more efficiently. The ReLU Layer applies the Rectified Linear Unit (ReLU) activation function element-wise to the Batch Normalization layer’s output. The ReLU function sets negative values to zero and keeps positive values unchanged. This step is a crucial process that introduces nonlinearity into the network, enabling CNN to learn complex and nonlinear relationships in the dataset. The final step in the convolutional layer is a 2 × 2 Max Pooling 2D layer, which reduces the feature maps’ spatial dimensions while retaining the most prominent features. Max Pooling 2D layer divides the input into non-overlapping regions and only keeps the maximum value from each region. Thus, Max pooling helps in reducing computational complexity and controlling the over fitting problem. Likewise, in this research, we employed three convolutional layers with filter sizes 16, 32 and 64.
The convolutional layers learn local image features, while the pooling layers reduce spatial dimensions. Batch normalization and ReLU layers help in normalization and nonlinearity, respectively. The fully connected layers capture high-level representations, and the softmax layer produces class probabilities for classification. By combining these layers and training the network using techniques like back propagation and optimization algorithms, CNNs can effectively learn and classify complex patterns in image data.

4.1. CNN Training and Testing Parameters

The optimal training and testing parameter values are set using trial and error methods to achieve better classification accuracy and improved CNN performance. The ten-fold cross-validation technique averages the performance across multiple validation sets. The dataset is divided into ten folds; each used as a validation set once. The remaining nine folds are used for training. The process is repeated ten times with a different fold as the validation set. As the dataset is imbalanced, the cross-validation techniques prevent overfitting and underfitting problems by representing the distribution of the minority and majority classes over every fold. The learning rate is selected to prevent overshooting and obtain an optimal set of weights for the CNN model. The mini-batches that use simultaneous processing of a subset of training samples provide efficient computation, and better generalization, reducing memory requirements and allowing for parallelization. The maximum number of epochs is fine-tuned to ensure the model has sufficient opportunities to learn the complex microarray gene dataset patterns while avoiding the risk of overfitting and under fitting. The number of convolutional and fully connected layers is chosen such that the CNN model captures intricate patterns and achieves higher accuracy in classification. The filter size is tuned to obtain the spatial relationships and extract different levels of features. The number of filters is selected to learn relevant patterns and representations in the microarray gene expression data. From all of the above observations, the optimal training and testing parameter values are fixed and are given in Table 6.
The cross-entropy loss is a popular loss function used with CNNs for training the model. The cross-entropy loss measures the difference between the predicted class probabilities and the true class labels. It enables the CNN to minimize the difference between the predicted and actual distributions. The cross entropy loss ‘L’ in CNN is defined as:
L y , p = i y i l o g ( p i )
where y is the one-hot encoded true class label vector, and p is the predicted class probability vector. The cross-entropy loss is significant for CNNs as it predicts how well the class probabilities match the ground truth labels. Also, the network learns meaningful representations and makes accurate predictions by minimizing the cross-entropy loss. Also, in CNN, the gradients obtained from the cross-entropy loss facilitate efficient parameter updates during back propagation.

4.2. Accuracy

The accuracy of a classifier is a measure of how well it correctly identifies the class labels of a dataset. It is calculated by dividing the number of correctly classified instances by the total number of instances in the dataset. The equation for accuracy is given by Wilson et al. [35].
A c c = T N + T P T N + F N + T P + F P
A sample of the accuracy and loss function during CNN training and testing progress is shown in Figure 8. The accuracy function measures the percentage of correctly classified samples to determine the CNN model’s performance. It also helps to monitor training progress and evaluate the model’s ability to generalize unseen data. On the other hand, the loss function quantifies the discrepancy between the predicted and actual outputs. By optimizing the loss function, the CNN model can adjust its parameters to minimize errors and learn the underlying patterns in the data.
Table 7 exhibits the training and testing performance of the classifiers based on the accuracy of the raw data and with STFT features for CNN Methods. As in the case of raw data, the training accuracy of classifiers always varied between 87.235% and 93.631%, while the testing accuracy varied from 83.425% to 90.607%. In the STFT feature method, Naïve Bayesian Classifier arrived at maximum training accuracy of 94.452%. The testing accuracy is ebbed at the peak value of 91.66% for the classifiers SDC and Naïve Bayesian. Table 7 shows that the nonlinear regression classifier is a poor-performing one for both cases of the inputs.
To analyze the statistical significance of training and testing process for CNN with Raw Data and STFT features, the mean and standard deviation of error is calculated. The mean of the standard error for training accuracy of CNN methods with Raw Data is obtained as 1.758%, and mean standard deviation of error attained as 0.324%. Likewise, the mean of the standard error for testing accuracy of CNN methods with Raw Data is obtained as 2.094%, and mean standard deviation of error attained as 0.568%. The mean of the standard error for training accuracy of CNN methods with STFT features is obtained as 1.565%, and mean standard deviation of error attained as 0.3721%. Likewise, the mean of the standard error for testing accuracy of CNN methods with STFT features is obtained as 1.952%, and mean standard deviation of error attained as 0.4688%. Overall, it is observed from the Table 7 that, the training and testing error is less than 2%, and standard deviation is less than 1%. Therefore, this implies that the error is insignificant when compared with the training and testing accuracy values.
As we evaluate Table 7 with minimum and maximum accuracy obtained for training and testing for the two CNN methods (Raw Data and STFT Features) by adding and subtracting the standard error values, the following inferences are arrived at. For the CNN method with Raw Data input, the SVM (RBF) attained the better training accuracy of 92.221% and 95.041% with the mean accuracy ± standard error. In the testing case, the SVM (RBF) attained a testing accuracy of 89.28773% for mean accuracy—standard error. At the same time, GMM attained the best testing accuracy of 92.24276% in mean accuracy + standard error condition. Likewise, for the CNN method with STFT Features, the Naïve Bayesian attained the better training accuracy of 93.0727% and 95.8327% with the mean accuracy ± standard error. In the testing case, the SDC attained the testing accuracy of 89.64667% for mean accuracy—standard error. At the same time, once again, Naïve Bayesian maintained the best testing accuracy of 94.11667% in mean accuracy + standard error condition. It is also observed that there is no change in the nonlinear regression classifier’s performance level.

5. Results and Discussion

The research utilizes formal ten-fold testing for machine learning and CNN classification methodologies. The training uses 85% of the gene expression data, and the remaining 15% are employed for testing the model. Since this research is binary classification, the performance metrics are accordingly chosen. In binary classification problems, a confusion matrix is a valuable tool for evaluating the performance of a machine learning model. As mentioned earlier, the confusion matrix summarizes the predictions made by the model against the true labels of the data. The performance metrics such as Accuracy, Precision, Recall, F1 score, MCC, Error Rate, and Kappa are derived by analyzing the values within the confusion matrix. Next, we discuss in detail the performance matrices employed in this research.
  • Precision
P r e c i s i o n = T P ( T P + F P )
  • Recall
R e c a l l = T P ( T P + F N )
  • F1 Score
The F1 score is a measure of a classifier’s accuracy that combines precision and recall into a single metric. It is calculated as the harmonic mean of precision and recall, with values ranging from 0 to 1, where 1 indicates perfect precision and recall. The equation for F1 score is given by Koizumi et al. [36].
F 1 = 2 × T P ( 2 × T P + F P + F N )
Here, precision is the proportion of true positives among all instances classified as positive, and recall is the proportion of true positives among all positive instances. The F1 score is useful when the classes in the dataset are imbalanced, meaning there are more instances of one class than the other. In such cases, accuracy may be a bad metric, as a classifier that predicts the majority class would have high accuracy but low precision and recall. The F1 score provides a more balanced measure of a classifier’s performance.
  • Matthews Correlation Coefficient (MCC)
MCC stands for “Matthews Correlation Coefficient”, which measures the quality of binary (two-class) classification models. It considers true and false positives and negatives and is particularly useful in situations where the classes are imbalanced.
The MCC is defined by the following equation as given in Chicco et al. [37]:
M C C = ( T P     T N F P     F N ) T P + F P )     ( T P + F N )     ( T N + F P )     ( T N + F N )
The MCC takes on values between −1 and 1, where a coefficient of 1 represents a perfect prediction, 0 represents a random prediction, and −1 represents a perfectly incorrect prediction.
  • Error Rate
As mentioned by Duda et al. [38], the error rate of a classifier is the proportion of misclassified instances. It can be calculated using the following equation:
E r r o r   r a t e = ( F P + F N ) ( T P + T N + F P + F N )
  • Kappa
The kappa statistic, also known as Cohen’s kappa, measures agreement between two raters or between a rater and a classifier. In the context of classification, it is used to evaluate the performance of a classifier on a binary or multi-class classification task. The kappa statistic measures the agreement between the predicted and true classes, considering the possibility of the agreement by chance. Cohen et al. [39] defined kappa as follows:
K a p p a = ( P o P e ) ( 1 P e )
where P o is the observed proportion of agreement, and P e is the proportion of agreement expected by chance. P o and  P e are calculated as follows:
P o = ( T P + T N ) ( T P + T N + F P + F N )
P e = ( T P + F P )     ( T P + F N ) + ( F P + T N )     ( F N + T N ) ( T P + T N + F P + F N ) 2
The Kappa is always less than or equal to 1. The Kappa coefficient with a value of 1 implies perfect agreement and any value less than 1 can be interpreted as follows: Poor agreement (<0.2), Fair agreement (0.2–0.4), Moderate agreement (0.4–0.6), Good agreement (0.6–0.8), and Very good agreement (0.8–1). The results are tabulated in the following tables.
Table 8 indicates the performance analysis of the classifiers based on parameters such as Accuracy, Precision, Recall, F1 Score, MCC, Error Rate, and Kappa values for the STFT Dimensionality Reduction method without feature selection methods. It is explored from Table 8 that the GMM Classifier is a high performing one with an accuracy of 80.66%, Precision of 92.59%, Recall of 83.33%, an F1 Score of 87.71%, with a low error rate of 19.34%. The GMM Classifier also demonstrates a moderate value of MCC 0.4419 and a Kappa value of 0.4285. The Softmax Discriminant Classifier is a low-performing classifier with a low accuracy of 59.11%, with high Error Rate of 40.89%, Precision of 90.42%, Recall of 56.66%, and an F1 Score of 69.67%. The MCC and Kappa values of the SD classifier are 0.2083 and 0.1609, respectively.
Table 9 depicts the performance analysis of the classifiers for the STFT Dimensionality Reduction method with PSO feature selection methods. It is identified from Table 9 that the SVM (RBF) Classifier achieved high accuracy of 94.47%, Precision of 97.94%, Recall of 95.33%, and an F1 Score of 96.62% with a low error rate of 5.52%. The SVM (RBF) Classifier has also reached a high value of MCC 0.81709 and Kappa value of 0.81485. The nonlinear regression classifier is placed at the lower edge with a low accuracy of 59.91%, a high Error Rate of 43.09%, Precision of 88.297%, Recall of 55.34%, and an F1 Score of 68.03%. The MCC and Kappa values of the nonlinear Regression classifier are at 0.1496 and 0.1156, correspondingly.
Table 10 exhibits the performance analysis of the classifiers for the STFT Dimensionality Reduction method with Harmonic Search feature selection methods. As shown in Table 10, the SVM (RBF) Classifier achieved high accuracy of 90.05%, Precision of 97.82%, Recall of 90%, F1 Score of 93.75% with a low error rate of 9.94%. The SVM (RBF) Classifier is also maintained at a high value of MCC 0.711 and Kappa value of 0.6963. Unfortunately, SVM (poly) classifier is placed at the low performing one with an accuracy of 59.11%, Precision of 88%, Recall of 58.67%, and a high Error Rate of 40.89% and an F1 Score of 70.4%. The MCC and Kappa value of the SVM (Poly) classifier is at 0.1512 and 0.1217, accordingly.
Figure 9 shows the Performance of Classifiers with and without Feature Selection regarding MCC and Kappa parameters. As indicated by Figure 9 that the SVM (RBF) Classifier is reached a high value of MCC 0.81709 and Kappa value of 0.81485. The nonlinear Regression Classifier for the PSO feature selection method attains low MCC and Kappa values of 0.1496 and 0.1156. The average MCC and Kappa values across the classifiers for feature selection are 0.3212 and 0.27403, respectively. The average MCC and Kappa values across the classifiers for PSO and Harmonic Search Feature selection methods are settled at 0.4194, 0.3878 and 0.4841 and 0.4667. This signature effect of Feature selection shows the improvement of average MCC and Kappa values across the classifiers.
Figure 10 shows the Performance of Classifiers with and without Feature Selection regarding parameters like Accuracy, F1 Score and Error Rate. As identified by Figure 10, the SVM (RBF) Classifier with the PSO feature selection method achieved higher values in Accuracy of 94.47%, an F1 Score of 96.62% and a lower Error Rate of 5.52% than all other categories of classifiers. In the case of the Harmonic Search feature selection method, once again SVM (RBF) classifier attained good values of Accuracy of 90.05%, an F1 Score of 93.75% and a low Error Rate of 9.944% compared with all other six classifiers. GMM Classifier attained an appreciable value of Accuracy of 80.66%, F1 Score of 87.71% and moderate Error Rate of 19.33% for the STFT inputs without any feature selection methods. The effect of feature selection improves the classifiers’ benchmark parameters and overall performance. The PSO feature selection retained its top position. It maintained its superiority over the harmonic search feature selection method, reflected in the classical improvements of the classifier’s performance.
Table 11 explores the performance analysis of the classifiers for raw microarray gene data with CNN methods. As displayed in Table 11, the SVM (RBF) Classifier achieved the highest accuracy of 90.607%, Precision of 90.797%, Recall of 98.67%, and an F1 Score of 94.56%, with a low Error Rate of 9.39%. The SVM (RBF) Classifier is also maintained at a moderate value of MCC 0.6329 and Kappa value of 0.6031. For the CNN method, all seven classifiers are maintained at more than 83% accuracy, Precision of 88.554%, Recall of 98%, and more than 90% F1 Score. The Softmax Discriminant Classifier attained minimum MCC and Kappa values of 0.5016 and 0.4616, respectively.
Table 12 exhibits the performance analysis of the classifiers for the STFT Dimensionality Reduction method with CNN methods. Table 12 shows that Softmax Discriminant Classifier (SDC) achieved a high accuracy of 91.66%, Precision of 93.54%, Recall of 96.67%, an F1 Score of 95.08%, with a low error rate of 8.33%. The SD Classifier is also maintained at a high MCC value of 0.6825 and Kappa value of 0.6785. All seven classifiers are maintained at more than 86% of accuracy, Precision of 87%, Recall of 96%, and more than 92% of F1 Score. The Naïve Bayesian Classifier achieved an accuracy of 91.66%, Precision of 90.909%, Recall of 100%, and also attained MCC and Kappa values of 0.6742 and 0.625, respectively. It is also observed from Table 12 that STFT input to the CNN method enhances the performance metrics of the classifiers when compared with raw input with CNN methods, as discussed in the paper.
Figure 11 depicts the Performance of Classifiers regarding MCC and Kappa parameters for Raw and STFT inputs to the CNN method. As indicated by Figure 11, the SD Classifier for the STFT feature method reached high MCC and Kappa values of 0.6825 and 0.6785, respectively. The average MCC and Kappa values across the classifiers are 0.5236 and 0.485. As shown by Figure 11 that the SVM (RBF) Classifier is attained at good values of MCC 0.6329 and Kappa value of 0.6031 for the raw inputs to the CNN method. The average MCC and Kappa value across the classifiers for CNN is 0.4794 and 0.4226, respectively. The Scalogram effect of STFT inputs to the CNN methods exemplifies the enhancement of average MCC and Kappa values across the classifiers.
Figure 12 shows the Performance of Classifiers regarding Accuracy, F1 Score, and Error Rate parameters for raw and STFT inputs with the CNN method. As depicted by Figure 12, the SVM(RBF) Classifier for raw input with the CNN method reached high accuracy, F1 Score and low Error Rates of 90.6%,94.56% and 9.39%, respectively. As displayed in Figure 12 that the SD Classifier is attained at the high accuracy value, F1 Score and low Error Rate of 91.66%, 95.08% and 8.33%, respectively, for STFT inputs with the CNN method. The yardstick effect of STFT inputs in CNN methods maketh the improvement in the accuracy, F1 Score and Error Rate values across the classifiers.
Figure 13 displays the Performance of Classifiers in terms of Deviation of MCC and Kappa Parameters with mean values. The MCC and Kappa are the benchmark parameters that indicate the classifiers’ outcomes for different inputs. As in this research, there are two categories of inputs like raw microarray genes, STFT, STFT with PSO and Harmonic Search Feature selection, STFT with Scalogram method and finally, CNN methods are provided to the classifiers. The classifier’s performance is observed through the attained MCC and Kappa values for these inputs. The average MCC and Kappa values from the classifiers are 0.4681 and 0.44518, respectively. Now a methodology is devised to identify the deviation of MCC and Kappa values from their respective mean values to point out the classifier’s performance. It is also observed from Figure 13 that the MCC and Kappa values placed in the third quadrant of the graph depict the nonlinear outcome of the classifiers with lower performance metrics. The MCC and Kappa values placed at the first quadrant show higher outcomes for classifiers with MCC and Kappa values more than the average value. This specifies the classifier’s performance is improved for the STFT inputs along with the PSO Feature selection method. Figure 13 is also denoted by the curve fitting of the linear line with the following equation y = 1.062x + 1 × 10−7 and R2 = 0.991 values.

Computational Complexity

The computational complexity analyzes the efficiency of the machine learning methods in terms of time and space complexity. In this research, the Big O notation represents the computational complexity of the dimensionality reduction, feature selection and classification methodologies. The computational complexity is represented by O (n), where n is the number of features. The O(log2n) means that the computational complexity increases ‘log2n’ times for any increase in ‘n’. The classifiers considered in this research are integrated with either dimensionality reduction techniques, feature selection techniques, or a combination of both. Therefore, the computational complexity also is a combination of these hybrid methods. Overall, the choice of methodology should consider the trade-off between computational complexity and the desired level of performance in classification methodology. Table 13 shows the computational complexity of the classifiers for the STFT dimensionality reduction method with and without feature selection methods, along with CNN Models.
As noted from Table 13, the SVM (Linear), NBC, Nonlinear Regression, Softmax Discriminant Classifiers, and SVM (RBF) classifiers are at the level of low computational complexity for the STFT DR method without feature selection methods. GMM Classifier attains moderate complexity of O (2n4 log2n) without feature selection methods and achieved good accuracy of 80.66%. For PSO and Harmonic search feature selection methods, the SVM (RBF) classifier with the computational complexity of O (2n5 log4n) placed at high accuracy of 94.47% and 90.05%, respectively. GMM and SVM (polynomial) classifiers are denoted by the high computational complexity of O (2n7 log2n) for the PSO and Harmonic Search Feature selection methods poorly performed in terms of accuracy score. These poor performances of classifiers are due to the presence of outlier problems in the STFT features. In order to enhance the performance of the classifiers, the STFT Scalogram features and CNN methods are included in this research. The SVM (RBF) Classifier attained good performance with moderate complexity of O (2n3 log4n) and O (2n3 log8n) for the methods, respectively. Next, through Table 14, we compare the accuracy of our research work with various methodologies across diverse microarray gene datasets employed for Adenocarcinoma and Mesothelioma lung cancer.
Table 15 compares previous works involving Lung and other types of cancer classification from microarray gene datasets with the reference from Fathi et al. [43] and Karmokar et al. [48], which contains performance metrics, namely Accuracy, Precision, Recall and F1 Score. Fathi et al. [43] utilized the PCC–DTCV (Pearson correlation–Decision Tree Cross Validation) model for lung cancer classification and reported accuracy in ranges from 93 to 95. The model involves a tuning parameter to maximize the depth of the decision tree classifier using grid search and Pearson correlation for the feature selection method. The grid search model is employed for dimensionality reduction and selects the most informative genes which involve higher computational complexity. In our research, we have utilized the gene expression intensity values incorporated with dimensionality reduction and classification. Unlike Fathi et al. [43], our methodology does not select prominent genes that increase computational overhead. Therefore, our method involves less computational complexity but is at an accuracy trade-off. Comparison with other types of cancer, namely prostate, breast, and leukemia, is also recorded.
Overall, this research evaluates and explores machine learning and CNN classifiers for detecting lung cancer from microarray gene expressions. The research investigation shows that although CNNs are good at learning features, due to their black-box nature, the CNNs fail to analyze the physical meaning behind forecasting and categorizing gene expression patterns from the learned features. Further, the LH2 dataset used for this research contains data imbalance, and many times during the analysis, CNNs were prone to overfitting and underfitting issues. The measures like the feature extraction step and replacement of the softmax layer with suitable classifiers were adopted to rectify these issues. However, the trade-off is that computational complexity will be increased while adding multiple stages to the CNN classification methodology.
Our research shows that microarray gene expression-based data are instrumental for disease classification. It contributes significantly to understanding gene expression patterns and their association with various biological processes and diseases. However, microarray genes have some limitations over general approaches like mRNA-Seq. The microarray gene expressions are selected from limited areas with specific genes or tissue regions. Therefore, the data generated may not capture the entire transcriptome and provide comprehensive information about gene expression. Also, microarray experiments may be prone to background noise and cross-hybridization problems due to non-specific binding, resulting in decreased sensitivity and specificity. Microarray experiments are expensive and sometimes have limited throughput compared to newer high-throughput sequencing technologies like mRNA-Seq.

6. Conclusions

The generation of malignant lung tissue is often inconspicuous and does not show symptoms in the early stages. Therefore the early detection of lung cancer is significant as it can lead to improved treatment outcomes and increased chances of survival. The microarray gene expression analysis of lung cancer data effectively detects lung cancer at an early stage, allowing for prompt intervention and potentially curative treatment options. Therefore, this paper uses microarray gene expression combined with machine and deep learning techniques to improve lung cancer data classification. The research employs two levels of pre-processing lung cancer data for better classification results: dimensionality reduction and feature selection techniques. The Short-Term Fourier Transform (STFT) dimensionality reduction with PSO feature selection and SVM (RBF) classifier produced the highest accuracy of 94.47%. The future endeavors of this research focus on improving the classification accuracy of the CNN classifiers by investigating nonlinear transforms like wavelets and nature-inspired feature extraction and selection methods incorporated into prior stages of CNN classification.

Author Contributions

Conceptualization, K.M.S., H.R. and A.R.N.; Methodology, K.M.S., H.R. and A.R.N.; Software, K.M.S. and H.R.; Validation, K.M.S. and H.R.; Formal analysis, H.R. and A.R.N.; Investigation, H.R.; Resources, A.R.N.; Data curation, A.R.N.; Writing—original draft, K.M.S.; Writing—review and editing, K.M.S.; Visualization, H.R. and A.R.N.; Supervision, H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dubin, S.; Griffin, D. Lung cancer in non-smokers. Mo. Med. 2020, 117, 375. [Google Scholar] [PubMed]
  2. Selman, M.; Pardo, A.; King, T.E., Jr. Hypersensitivity pneumonitis: Insights in diagnosis and pathobiology. Am. J. Respir. Crit. Care Med. 2012, 186, 314–324. [Google Scholar] [CrossRef] [Green Version]
  3. Infante, M.; Lutman, F.R.; Cavuto, S.; Brambilla, G.; Chiesa, G.; Passera, E.; Angeli, E.; Chiarenza, M.; Aranzulla, G.; Cariboni, U.; et al. Lung cancer screening with spiral CT: Baseline results of the randomized DANTE trial. Lung Cancer 2008, 59, 355–363. [Google Scholar] [CrossRef]
  4. Thunnissen, F.B.J.M. Sputum examination for early detection of lung cancer. J. Clin. Pathol. 2003, 56, 805–810. [Google Scholar] [CrossRef] [Green Version]
  5. Andolfi, M.; Potenza, R.; Capozzi, R.; Liparulo, V.; Puma, F.; Yasufuku, K. The role of bronchoscopy in the diagnosis of early lung cancer: A review. J. Thorac. Dis. 2016, 8, 3329. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Zhu, C.-Q.; Pintilie, M.; John, T.; Strumpf, D.; Shepherd, F.A.; Der, S.D.; Jurisica, I.; Tsao, M.-S. Understanding prognostic gene expression signatures in lung cancer. Clin. Lung Cancer 2009, 10, 331–340. [Google Scholar] [CrossRef] [PubMed]
  7. Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nat. Genet. 2002, 32, 490–495. [Google Scholar] [CrossRef]
  8. Ross, D.T.; Scherf, U.; Eisen, M.B.; Perou, C.M.; Rees, C.; Spellman, P.; Iyer, V.; Jeffrey, S.S.; Van de Rijn, M.; Waltham, M.; et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 2000, 24, 227–235. [Google Scholar] [CrossRef] [Green Version]
  9. Reis-Filho, J.S.; Pusztai, L. Gene expression profiling in breast cancer: Classification, prognostication, and prediction. Lancet 2011, 378, 1812–1823. [Google Scholar] [CrossRef]
  10. Lapointe, J.; Li, C.; Higgins, J.P.; van de Rijn, M.; Bair, E.; Montgomery, K.; Ferrari, M.; Egevad, L.; Rayford, W.; Bergerheim, U.; et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl. Acad. Sci. USA 2004, 101, 811–816. [Google Scholar] [CrossRef]
  11. Dwivedi, A.K. Artificial neural network model for effective cancer classification using microarray gene expression data. Neural Comput. Appl. 2018, 29, 1545–1554. [Google Scholar] [CrossRef]
  12. Rody, A.; Holtrich, U.; Pusztai, L.; Liedtke, C.; Gaetje, R.; Ruckhaeberle, E.; Solbach, C.; Hanker, L.; Ahr, A.; Metzler, D.; et al. T-cell metagene predicts a favorable prognosis in estrogen receptor-negative and HER2-positive breast cancers. Breast Cancer Res. 2009, 11, R15. [Google Scholar] [CrossRef] [PubMed]
  13. Sanchez-Palencia, A.; Gomez-Morales, M.; Gomez-Capilla, J.A.; Pedraza, V.; Boyero, L.; Rosell, R.; Fárez-Vidal, M.E. Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int. J. Cancer 2011, 129, 355–364. [Google Scholar] [CrossRef]
  14. Kerkentzes, K.; Lagani, V.; Tsamardinos, I.; Vyberg, M.; Røe, O.D. Hidden treasures in “ancient” microarrays: Gene-expression portrays biology and potential resistance pathways of major lung cancer subtypes and normal tissue. Front. Oncol. 2014, 4, 251. [Google Scholar] [CrossRef] [PubMed]
  15. Wagner, J.C.; Berry, G.; Skidmore, J.W.; Timbrell, V. The Effects of the Inhalation of Asbestos in Rats. Br. J. Cancer 1974, 29, 252–269. [Google Scholar] [CrossRef] [Green Version]
  16. Weigelt, B.; Baehner, F.L.; Reis-Filho, J.S. The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: A retrospective of the last decade. J. Pathol. J. Pathol. Soc. Great Br. Irel. 2010, 220, 263–280. [Google Scholar] [CrossRef] [PubMed]
  17. Hira, Z.M.; Gillies, D.F. A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 2015, 198363. [Google Scholar] [CrossRef]
  18. Qi, Y.; Yang, L.; Liu, B.; Liu, L.; Liu, Y.; Zheng, Q.; Liu, D.; Luo, J. Highly accurate diagnosis of lung adenocarcinoma and squamous cell carcinoma tissues by deep learning. Spectrochim. Part A Mol. Biomol. Spectrosc. 2022, 265, 120400. [Google Scholar] [CrossRef]
  19. Abdelwahab, O.; Awad, N.; Elserafy, M.; Badr, E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS ONE 2022, 17, e0269126. [Google Scholar] [CrossRef]
  20. Shukla, A.K.; Tripathi, D.; Reddy, B.R.; Chandramohan, D. A study on metaheuristics approaches for gene selection in microarray data: Algorithms, applications and open challenges. Evol. Intell. 2020, 13, 309–329. [Google Scholar] [CrossRef]
  21. Gordon, G.J.; Jensen, R.V.; Hsiao, L.-L.; Gullans, S.R.; Blumenstock, J.E.; Ramaswamy, S.; Richards, W.G.; Sugarbaker, D.J.; Bueno, R. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 2002, 62, 4963–4967. [Google Scholar]
  22. Gupta, V.; Mittal, M. QRS Complex Detection Using STFT, Chaos Analysis, and PCA in Standard and Real-Time ECG Databases. J. Inst. Eng. India Ser. B 2019, 100, 489–497. [Google Scholar] [CrossRef]
  23. Martín, C.A.; Pinillo, R.I.; Barasinski, A.; Chinesta, F. Code2vect: An efficient heterogenous data classifier and nonlinear regression technique. Comptes Rendus Mécanique 2019, 347, 754–761. [Google Scholar] [CrossRef]
  24. Dalatu, P.I.; Fitrianto, A.; Mustapha, A. A Comparative Study of Linear and Nonlinear Regression Models for Outlier Detection. In Recent Advances on Soft Computing and Data Mining, Proceedings of the SCDM 2016, Bandung, Indonesia, 18–20 August 2016; Herawan, T., Ghazali, R., Nawi, N.M., Deris, M.M., Eds.; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2017; Volume 549. [Google Scholar] [CrossRef]
  25. Ficklin, S.P.; Dunwoodie, L.J.; Poehlman, W.L.; Watson, C.; Roche, K.E.; Feltus, F.A. Discovering condition-specific gene co-expression patterns using gaussian mixture models: A cancer case study. Sci. Rep. 2017, 7, 8617. [Google Scholar] [CrossRef]
  26. Shah, S.H.; Iqbal, M.J.; Ahmad, I.; Khan, S.; Rodrigues, J.J.P.C. Optimized gene selection and classification of cancer from microarray gene expression data using deep learning. Neural Comput. Appl. 2020, 1–12. [Google Scholar] [CrossRef]
  27. Ahmed, S.; Shahjaman, M.; Rana, M.; Mollah, N.H. Robustification of Naïve Bayes classifier and its application for microarray gene expression data analysis. BioMed Res. Int. 2017, 2017, 3020627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Vanitha, C.D.A.; Devaraj, D.; Venkatesulu, M. Gene expression data classification using support vector machine and mutual information-based gene selection. Procedia Comput. Sci. 2015, 47, 13–21. [Google Scholar] [CrossRef] [Green Version]
  29. Dai, W.; Chuang, Y.-Y.; Lu, C.-J. A clustering-based sales forecasting scheme using support vector regression for computer server. Procedia Manuf. 2015, 2, 82–86. [Google Scholar] [CrossRef] [Green Version]
  30. Ge, F.; Ju, Y.; Qi, Z.; Lin, Y. Parameter Estimation of a Gaussian Mixture Model for Wind Power Forecast Error by Riemann L-BFGS Optimization. IEEE Access 2018, 6, 38892–38899. [Google Scholar] [CrossRef]
  31. Hastie, T.; Tibshirani, R.; Buja, A. Flexible discriminant analysis by optimal scoring. J. Am. Stat. Assoc. 1994, 89, 1255–1270. [Google Scholar] [CrossRef]
  32. Berrar, D. Bayes’ theorem and naive Bayes classifier. Encycl. Bioinform. Comput. Biol. ABC Bioinform. 2018, 403, 412. [Google Scholar]
  33. Zhang, R.; Wang, W. Facilitating the applications of support vector machine by using a new kernel. Expert Syst. Appl. 2011, 38, 14225–14230. [Google Scholar] [CrossRef]
  34. Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 2011, 21, 137–146. [Google Scholar] [CrossRef]
  35. Wilson, S.W. Classifier fitness based on accuracy. Evol. Comput. 1995, 3, 149–175. [Google Scholar] [CrossRef]
  36. Koizumi, Y.; Saito, S.; Uematsu, H.; Harada, N. Optimizing acoustic feature extractor for anomalous sound detection based on Neyman-Pearson lemma. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017. [Google Scholar]
  37. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [Green Version]
  38. Duda, R.O.; Fossum, H. Pattern classification by iteratively determined linear and piecewise linear discriminant functions. IEEE Trans. Electron. Comput. 1966, 2, 220–232. [Google Scholar] [CrossRef]
  39. Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213. [Google Scholar] [CrossRef] [PubMed]
  40. Gupta, S.; Gupta, M.K.; Shabaz, M.; Sharma, A. Deep learning techniques for cancer classification using microarray gene expression data. Front. Physiol. 2022, 13, 952709. [Google Scholar] [CrossRef]
  41. Ke, L.; Li, M.; Wang, L.; Deng, S.; Ye, J.; Yu, X. Improved swarm-optimization-based filter-wrapper gene selection from microarray data for gene expression tumor classification. Pattern Anal. Appl. 2022, 26, 455–472. [Google Scholar] [CrossRef]
  42. Morani, F.; Bisceglia, L.; Rosini, G.; Mutti, L.; Melaiu, O.; Landi, S.; Gemignani, F. Identification of overexpressed genes in malignant pleural mesothelioma. Int. J. Mol. Sci. 2021, 22, 2738. [Google Scholar] [CrossRef] [PubMed]
  43. Fathi, H.; AlSalman, H.; Gumaei, A.; Manhrawy, I.I.M.; Hussien, A.G.; El-Kafrawy, P. An efficient cancer classification model using microarray and highdimensional data. Comput. Intell. Neurosci. 2021, 2021, 7231126. [Google Scholar] [CrossRef]
  44. Xia, D.; Leon, A.J.; Cabanero, M.; Pugh, T.J.; Tsao, M.S.; Rath, P.; Siu, L.L.-Y.; Yu, C.; Bedard, P.L.; Shepherd, F.A.; et al. Minimalist approaches to cancer tissue-of-origin classification by DNA methylation. Mod. Pathol. 2020, 33, 1874–1888. [Google Scholar] [CrossRef] [PubMed]
  45. Azzawi, H.; Hou, J.; Xiang, Y.; Alanni, R. Lung cancer prediction from microarray data by gene expression programming. IET Syst. Biol. 2016, 10, 168–178. [Google Scholar] [CrossRef]
  46. Guan, P.; Huang, D.; He, M.; Zhou, B. Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method. J. Exp. Clin. Cancer Res. 2009, 28, 103. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Mramor, M.; Leban, G.; Demšar, J.; Zupan, B. Visualization-based cancer microarray data classification analysis. Bioinformatics 2007, 23, 2147–2154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Karmokar, J.; Islam, M.A.; Uddin, M.; Hassan, M.R.; Yousuf, M.S.I. An assessment of meteorological parameters effects on COVID-19 pandemic in Bangladesh using machine learning models. Environ. Sci. Pollut. Res. 2022, 29, 67103–67114. [Google Scholar] [CrossRef] [PubMed]
  49. Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.; Richie, J.P.; et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1, 203–209. [Google Scholar] [CrossRef] [Green Version]
  50. Chin, K.; DeVries, S.; Fridlyand, J.; Spellman, P.T.; Roydasgupta, R.; Kuo, W.-L.; Lapuk, A.; Neve, R.M.; Qian, Z.; Ryder, T.; et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10, 529–541. [Google Scholar] [CrossRef] [Green Version]
  51. Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Methodology for Dimensionality Reduction and Classification of microarray gene expression data for lung cancer.
Figure 1. Methodology for Dimensionality Reduction and Classification of microarray gene expression data for lung cancer.
Bioengineering 10 00933 g001
Figure 2. Scatter plot for STFT based Dimensionality Reduction Method in Meso and Adeno Carcinoma Cancer Classes.
Figure 2. Scatter plot for STFT based Dimensionality Reduction Method in Meso and Adeno Carcinoma Cancer Classes.
Bioengineering 10 00933 g002
Figure 3. Normal Probability plot for STFT Dimensionality Reduction Method with PSO Feature Selection in Adeno Carcinoma Cancer Classes.
Figure 3. Normal Probability plot for STFT Dimensionality Reduction Method with PSO Feature Selection in Adeno Carcinoma Cancer Classes.
Bioengineering 10 00933 g003
Figure 4. Normal probability plot for STFT Dimensionality Reduction Method with PSO Feature Selection in Meso Carcinoma Cancer Classes.
Figure 4. Normal probability plot for STFT Dimensionality Reduction Method with PSO Feature Selection in Meso Carcinoma Cancer Classes.
Bioengineering 10 00933 g004
Figure 5. Histogram for STFT Dimensionality Reduction Method with Harmonic Search Feature Selection in Adeno Carcinoma Cancer Classes.
Figure 5. Histogram for STFT Dimensionality Reduction Method with Harmonic Search Feature Selection in Adeno Carcinoma Cancer Classes.
Bioengineering 10 00933 g005
Figure 6. Histogram for STFT Dimensionality Reduction Method with Harmonic Search Feature Selection in Meso Carcinoma Cancer Classes.
Figure 6. Histogram for STFT Dimensionality Reduction Method with Harmonic Search Feature Selection in Meso Carcinoma Cancer Classes.
Bioengineering 10 00933 g006
Figure 7. CNN Classification Methodology.
Figure 7. CNN Classification Methodology.
Bioengineering 10 00933 g007
Figure 8. Training and testing progress of the first two epochs of SVM (RBF) Classifier in CNN Method with Raw Data.
Figure 8. Training and testing progress of the first two epochs of SVM (RBF) Classifier in CNN Method with Raw Data.
Bioengineering 10 00933 g008
Figure 9. Performance of Classifiers Without and With Feature Selection Methods in terms of MCC and Kappa Parameters.
Figure 9. Performance of Classifiers Without and With Feature Selection Methods in terms of MCC and Kappa Parameters.
Bioengineering 10 00933 g009
Figure 10. Performance of Classifiers Without and With Feature Selection Methods in terms of Accuracy, F1 Score and Error Rate Parameters.
Figure 10. Performance of Classifiers Without and With Feature Selection Methods in terms of Accuracy, F1 Score and Error Rate Parameters.
Bioengineering 10 00933 g010
Figure 11. Performance of Classifiers in terms of MCC and Kappa Parameters for Raw and STFT Inputs for CNN Methods.
Figure 11. Performance of Classifiers in terms of MCC and Kappa Parameters for Raw and STFT Inputs for CNN Methods.
Bioengineering 10 00933 g011
Figure 12. Performance of Classifiers in terms of Accuracy, F1 Score and Error Rate Parameters for Raw and STFT Inputs in CNN Methods.
Figure 12. Performance of Classifiers in terms of Accuracy, F1 Score and Error Rate Parameters for Raw and STFT Inputs in CNN Methods.
Bioengineering 10 00933 g012
Figure 13. Performance of Classifiers in terms of Deviation of MCC and Kappa Parameters with mean Values.
Figure 13. Performance of Classifiers in terms of Deviation of MCC and Kappa Parameters with mean Values.
Bioengineering 10 00933 g013
Table 1. Average Statistical Features for STFT Dimensionally Reduced Adeno Carcinoma and Meso Cancer Cases.
Table 1. Average Statistical Features for STFT Dimensionally Reduced Adeno Carcinoma and Meso Cancer Cases.
Sl. NoStatistical FeaturesCancer Classes
AdenocarcinomaMeso Cancer
1Mean7527.9939410.937
2Variance7,274,8168,102,763
3Skewness14.773314.68975
4Kurtosis272.0948270.2702
5Pearson Correlation Coefficient (PCC)0.9933590.993084
6f-test0.0111660.390122
7t-test0.2641750.000004
8p-value < 0.010.49580.98
9Permutation Entropy0.69310.6931
10Sample Entropy11.000711.0007
11Canonical Correlation Analysis (CCA)0.367333
Table 2. Analysis of Friedman Test in Feature Selection Methods on STFT Data.
Table 2. Analysis of Friedman Test in Feature Selection Methods on STFT Data.
Sl. NoFeature Selection MethodsParametersAdeno Carcinoma CancerMeso Cancer
1PSOX2r statistic2.3261.376
p-value0.676040.84836
2Harmonic SearchX2r statistic6.391.18
p-value0.171850.88138
Table 3. Confusion matrix for Lung Cancer Detection.
Table 3. Confusion matrix for Lung Cancer Detection.
Truth of Clinical SituationPredicted Values
Adeno CarcinomaMeso Cancer
Actual ValuesAdeno CarcinomaTPFN
Meso cancerFPTN
Table 4. Training and Testing MSE Analysis of Classifiers for STFT Dimensionality Reduction Technique without and with PSO and Harmonic Search Feature Selection.
Table 4. Training and Testing MSE Analysis of Classifiers for STFT Dimensionality Reduction Technique without and with PSO and Harmonic Search Feature Selection.
ClassifiersWithout Feature SelectionWith PSO Feature SelectionWith Harmonic Search Feature Selection
Training MSETesting MSETraining MSETesting MSETraining MSETesting MSE
Nonlinear Regression1.68 × 10−75.63 × 10−64.9 × 10−71.32 × 10−56.25 × 10−69.6 × 10−6
GMM3.6 × 10−87.21 × 10−76.76 × 10−84.84 × 10−69 × 10−71.96 × 10−6
SDC3.25 × 10−78.84 × 10−52.89 × 10−73.61 × 10−62.35 × 10−71.16 × 10−6
Naïve Bayesian5.48 × 10−76.24 × 10−51.09 × 10−71.52 × 10−51.32 × 10−76.76 × 10−6
SVM(Linear)2.02 × 10−72.51 × 10−51.76 × 10−71.1 × 10−61.22 × 10−78.1 × 10−5
SVM(Poly)9 × 10−73.48 × 10−54.9 × 10−83.03 × 10−56.4 × 10−77.06 × 10−5
SVM(RBF)4.84 × 10−76.56 × 10−52.25 × 10−93.6 × 10−82.56 × 10−81.96 × 10−7
Table 5. Selection of Classifier Parameters.
Table 5. Selection of Classifier Parameters.
ClassifiersDescription
NLRClass target T1 = 0.85 and T2 = 0.65, k1, k2, and k3 is from Equation (4), Convergence Criteria: MSE
GMMMean, Covariance of the input samples, and tuning parameters with test point likelihood probability 0.1, cluster probability of 0.5, with convergence rate of 0.6, Convergence Criteria: MSE
SDCγ = 0.5, along with mean of each class target values as 0.65 and 0.85, Class Weights (w): 0.4, Bias (b): 0.05, Convergence Criteria: MSE
NBCSmoothing parameter (alpha): 0.06, Prior Probabilities: 0.15, Convergence Criteria: MSE
SVM (Linear)α (Regularization Parameter): 0.85, w: 0.4, b: 0.01, Convergence Criteria: MSE
SVM (Polynomial)α: 0.76, Coefficient of the kernel function (γ): 10, w: 0.5, b: 0.01, Convergence Criteria: MSE
SVM (RBF)α: 1, Kernel width parameter (σ): 100, w: 0.86, b: 0.01, Convergence Criteria: MSE
Table 6. Training and Testing Parameters of CNN Methodology for Raw Data and STFT Dimensionally reduced inputs.
Table 6. Training and Testing Parameters of CNN Methodology for Raw Data and STFT Dimensionally reduced inputs.
ParameterValue
Number of cross-validation10
Learning rate0.01
Number of mini-batch100
Maximum number of epochs10
Number of convolutional layer3
Number of fully connected layer4
Number of filter in first convolutional layer16
Size of filter in first convolutional layer3 × 3
Number of filter in second convolutional layer32
Size of filter in second convolutional layer3 × 3
Number of filter in third convolutional layer64
Size of filter in third convolutional layer3 × 3
Loss functionCross-entropy
Table 7. Training and Testing Accuracy Analysis of various Classifiers in CNN Method with Raw Data and STFT features.
Table 7. Training and Testing Accuracy Analysis of various Classifiers in CNN Method with Raw Data and STFT features.
ClassifiersCNN Methods with Raw Data CNN Methods with STFT Features
Training AccuracyTesting
Accuracy
Training AccuracyTesting
Accuracy
Nonlinear Regression87.235 ± 1.6483.42541 ± 2.4189.327 ± 1.8786.11111 ± 2.18
GMM92.752 ± 1.7689.50276 ± 2.7491.458 ± 2.0188.88889 ± 1.96
SDC90.516 ± 2.4287.8453 ± 2.2594.372 ± 1.4691.66667 ± 2.02
Naïve Bayesian90.023 ± 1.5386.74033 ± 2.6894.4527 ± 1.3891.66667 ± 2.45
SVM(Linear)91.358 ± 1.7287.8453 ± 1.7491.753 ± 1.2288.88889 ± 2.39
SVM(Poly)90.145 ± 1.8388.39779 ± 1.5291.991 ± 1.9488.88889 ± 1.51
SVM(RBF)93.631 ± 1.4190.60773 ± 1.3292.047 ± 1.0888.88889 ± 1.16
Table 8. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique without Feature Selection.
Table 8. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique without Feature Selection.
ClassifiersParameters
AccuracyPrecisionRecallF1 ScoreMCCError RateKappa
Nonlinear Regression65.1933793.06962.66674.90040.30409834.806630.246382
GMM80.6629892.59283.3387.71930.44196919.337020.428507
SDC59.1160290.4256.66669.672130.20837940.883980.160987
Naïve Bayesian60.7734888.34960.66671.936760.16704539.226520.137111
SVM(Linear)72.3756993.10372.0081.203010.36276227.624310.321894
SVM(Poly)71.823295.41269.3380.308880.40953828.17680.348967
SVM(RBF)64.6408895.7460.0073.770490.35513535.359120.274367
Table 9. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with PSO Feature Selection.
Table 9. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with PSO Feature Selection.
ClassifiersParameters
AccuracyPrecisionRecallF1 ScoreMCCError RateKappa
Nonlinear Regression56.9060888.29755.3468.032790.14967643.093920.115635
GMM86.1878596.29686.6691.228070.61038213.812150.591791
SDC81.7679689.79588.0088.888890.38208918.232040.381485
Naïve Bayesian80.110596.2986.6786.956520.49677119.88950.463968
SVM(Linear)89.5027695.17292.0093.559320.65519710.497240.652451
SVM(Poly)69.6132690.5970.6779.400750.27725230.386740.247373
SVM(RBF)94.4751497.9495.3396.621620.8170975.5248620.814853
Table 10. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with Harmonic Search Feature Selection.
Table 10. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with Harmonic Search Feature Selection.
ClassifiersParameters
AccuracyPrecisionRecallF1 ScoreMCCError RateKappa
Nonlinear Regression61.3259794.4456.6670.833330.30545238.674030.229319
GMM88.9502896.42890.0093.103450.66488211.049720.654909
SDC80.110593.181882.0087.234040.4491119.88950.430519
Naïve Bayesian80.110590.1485.3387.671230.36810719.88950.364417
SVM(Linear)61.3259793.4757.3471.074380.28620438.674030.217998
SVM(Poly)59.1160288.0058.6770.40.15120940.883980.121705
SVM(RBF)90.0552597.8290.0093.750.7110349.9447510.696309
Table 11. Performance Analysis of Classifiers for Raw Data Without Dimensionality Reduction with CNN Method.
Table 11. Performance Analysis of Classifiers for Raw Data Without Dimensionality Reduction with CNN Method.
ClassifiersParameters
AccuracyPrecisionRecallF1 ScoreMCCError RateKappa
Nonlinear Regression83.4254183.7099.3490.853660.17070916.57458560.090147
GMM89.5027690.18498.0093.929710.58397410.49723760.55643
SDC87.845388.55498.0093.037970.50165812.15469610.461601
Naïve Bayesian86.7403386.2010092.592590.44120413.25966850.325885
SVM(Linear)87.845388.09598.6793.081760.49830812.15469610.443699
SVM(Poly)88.3977988.62298.6793.375390.5271111.60220990.477669
SVM(RBF)90.6077390.79798.6794.568690.6329779.392265190.603121
Table 12. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with CNN Method.
Table 12. Performance Analysis of Classifiers for STFT Dimensionality Reduction Technique with CNN Method.
ClassifiersParameters
AccuracyPrecisionRecallF1 ScoreMCCError RateKappa
Nonlinear Regression86.1111187.878796.6792.063490.4045213.888890.375
GMM88.8888990.62596.6793.548390.55339911.111110.538462
SDC91.6666793.5496.6795.081970.68258.3333330.678571
Naïve Bayesian91.6666790.90910095.23810.67428.3333330.625
SVM(Linear)88.8888990.62596.6793.548390.55339911.111110.538462
SVM(Poly)88.8888990.62596.6793.548390.55339911.111110.538462
SVM(RBF)88.8888990.62596.6793.548390.55339911.111110.538462
Table 13. Computational Complexity of the Classifiers for STFT Dimensionality Reduction Method without and with Feature selection methods and CNN Models.
Table 13. Computational Complexity of the Classifiers for STFT Dimensionality Reduction Method without and with Feature selection methods and CNN Models.
ClassifiersWithout Feature SelectionWith PSO Feature SelectionWith Harmonic Search CNN with Raw DataCNN STFT DR Method
Nonlinear RegressionO(2n3 log2n)O(2n6 log2n)O(2n6 log2n)O(2n4 log2n)O(2n4 log4n)
GMMO(2n4 log2n)O(2n7 log2n)O(2n7 log2n)O(2n5log2n)O(2n5log4n)
Softmax DiscriminantO(2n3 log2n)O(2n6 log2n)O(2n6 log2n)O(2n4 log2n)O(2n4 log4n)
Naïve BayesianO(2n3 log2n)O(2n6 log2n)O(2n6 log2n)O(2n4 log2n)O(2n4 log4n)
SVM (Linear)O(2n3 log2n)O(2n6 log2n)O(2n6 log2n)O(2n4 log2n)O(2n4 log4n)
SVM (Poly)O(2n4 log2n)O(2n7 log2n)O(2n7 log2n)O(2n5 log2n)O(2n5 log4n)
SVM (RBF)O(2n2 log4n)O(2n5 log4n)O(2n5 log4n)O(2n3 log4n)O(2n3 log8n)
Table 14. Comparison with Existing Works in Adenocarcinoma and Mesothelioma lung cancer classification from microarray gene datasets.
Table 14. Comparison with Existing Works in Adenocarcinoma and Mesothelioma lung cancer classification from microarray gene datasets.
Serial No.AuthorDatasetMethodologyAccuracy
1As reported in this research workLH2 DatasetSTFT with PSO feature selection and SVM (RBF) classifier94.47
2Gupta et al. (2022) [40]Cancer Genome Atlas DatasetDeep learning with CNN Methodology92
3Lin Ke et.al. (2022) [41]LH2 DatasetDecision Tree C4.5 Classification93.2
4Federica Morani et al. (2021) [42]Cancer Genome Atlas and Curated Gene Expression DatasetsMultivariate approach using Cox90
5Fathi et al. (2021) [43]LH2 DatasetFeature combination from different network layers using Decision tree85
6Daniel Xia et al. (2020) [44]LH2 DatasetMinimalist approach for Classification90.6
7Azzavi
(2015) et al. [45]
Kent Ridge Bio-medical DatasetMultilayer perceptron (MLP), RBF and SVM91.39
91.72
89.82
8Guan et al. (2009) [46]HG-U95 DatasetSVM with RBF Kernel94
9Mramor et al. (2007) [47]Whitehead Institute DatabaseSVM, NBC, k-nearest neighbor (KNN), and Decision Tree94.67
90.35
75.28
91.21
10Gordon (2002) [21]LH2 DatasetTranslation of microarray data to gene expressions90
Table 15. Comparison of previous works involving lung and other types of cancer classification from microarray gene datasets.
Table 15. Comparison of previous works involving lung and other types of cancer classification from microarray gene datasets.
Serial No.Type of Cancer and DatasetMethodologyAccuracyPrecisionRecallF1 Score
1Lung cancer—LH2 Dataset [21]GMM Classifier with STFT DR and without feature selection80.6692.59283.3387.7193
SVM (RBF) Classifier with STFT DR and PSO feature selection90.47597.9495.3396.62162
SVM (RBF) Classifier with STFT DR and HS feature selection90.05597.829093.75
SVM (RBF) Classifier with CNN and Raw Data90.60790.79798.6794.56869
Softmax Discriminant Classifier with CNN and STFT DR91.66693.5496.6795.08197
2Lung cancer—Gordon (2002) [21]PCC-DTCV model with optimized DT and PPC ≥ 0.493848484
PCC-DTCV model with optimized DT and PPC ≥ 0.594777777
PCC-DTCV model with optimized DT and PPC ≥ 0.69592.47784
3Prostate cancer—Singh (2002) [49]SVM classifier with Elastic Net95.993696.428696.13196.27957
SVM classifier with Elastic Net CV99.456596.428696.13196.27957
MLP classifier with Lasso95.112296.428696.13196.27957
MLP classifier with Elastic Net96.073798.214398.214398.2143
4Breast cancer—Chin (2006) [50]SVM classifier with Elastic Net86.607187.310672.579.2190067
SVM classifier with Elastic Net CV89.88187.310695.972291.4367345
MLP classifier with Lasso87.440579.166789.368783.9589198
MLP classifier with Elastic Net88.988189.507679.166784.0201657
5Leukemia—Golub (1999) [51]SVM classifier with Elastic Net97.222296.87597.916797.3930646
SVM classifier with Elastic Net CV97.222296.87597.916797.3930646
MLP classifier with Lasso95.8333100100100
MLP classifier with Elastic Net97.222295.833393.7594.7802035
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

M S, K.; Rajaguru, H.; Nair, A.R. Evaluation and Exploration of Machine Learning and Convolutional Neural Network Classifiers in Detection of Lung Cancer from Microarray Gene—A Paradigm Shift. Bioengineering 2023, 10, 933. https://doi.org/10.3390/bioengineering10080933

AMA Style

M S K, Rajaguru H, Nair AR. Evaluation and Exploration of Machine Learning and Convolutional Neural Network Classifiers in Detection of Lung Cancer from Microarray Gene—A Paradigm Shift. Bioengineering. 2023; 10(8):933. https://doi.org/10.3390/bioengineering10080933

Chicago/Turabian Style

M S, Karthika, Harikumar Rajaguru, and Ajin R. Nair. 2023. "Evaluation and Exploration of Machine Learning and Convolutional Neural Network Classifiers in Detection of Lung Cancer from Microarray Gene—A Paradigm Shift" Bioengineering 10, no. 8: 933. https://doi.org/10.3390/bioengineering10080933

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop