Gene selection from microarray data for cancer classification—a machine learning approach

doi:10.1016/j.compbiolchem.2004.11.001

Computational Biology and Chemistry

Volume 29, Issue 1, February 2005, Pages 37-46

https://doi.org/10.1016/j.compbiolchem.2004.11.001 Get rights and content

Abstract

A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.

Introduction

Accurate cancer diagnosis is vital for the successful application of specific therapies. Although cancer classification has improved over the last decade, there is still a need for a fully automated and less subjective method for cancer diagnosis. Recent studies demonstrated that DNA microarrays could provide useful information for cancer classification at the gene expression level due to their ability to measure the abundance of messenger ribonucleic acid (mRNA) transcripts for thousands of genes simultaneously.

Several machine learning algorithms have already been applied to classifying tumors using microarray data. Voting machines and self-organising maps (SOM) were used to analyse acute leukemia (Golub et al., 1999). Support vector machines (SVMs) were applied to multi-class cancer diagnosis by (Ramaswamy et al., 2001). Hierarchical clustering was used to analyse colon tumor (Alon et al., 1999). The best classification results are reported by Li et al. (2003) and Antonov et al. (2004). Li et al. employed a rule discovery method and Antonov et al. maximal margin linear programming (MAMA).

Given the nature of cancer microarray data, which usually consists of a few hundred samples with thousands of genes as features, the analysis has to be carried out carefully. Work in such a high dimensional space is extremely difficult if not impossible. One straightforward approach to select relevant genes is the application of standard parametric tests such as the t-test Thomas et al., 2001, Tsai et al., 2003 and a non-parametric test such as the Wilcoxon score test Thomas et al., 2001, Antoniadis et al., 2003. Wilks’s Lambda score was proposed by (Hwang et al., 2002) to access the discriminatory power of individual genes. A new procedure (Antonov et al., 2004) was designed to detect groups of genes that are strongly associated with a particular cancer type.

In this paper we consider two general approaches to feature subset selection, more specifically, wrapper and filter approaches, for gene selection. Wrappers and filters differ in how they evaluate feature subsets. Filter approaches remove irrelevant features according to general characteristics of the data. Wrapper approaches, by contrast, apply machine learning algorithms to feature subsets and use cross-validation to evaluate the score of feature subsets. Most methods of gene selection for microarray data analysis focus on filter approaches, although there are a few publications on applying wrapper approaches Inza et al., 2004, Xiong et al., 2001, Xing et al., 2001. Nevertheless, in theory, wrappers should provide more accurate classification results than filters (Langley, 1994). Wrappers use classifiers to estimate the usefulness of feature subsets. The use of “tailor-made” feature subsets should provide a better classification accuracy for the corresponding classifiers, since the features are selected according to their contribution to the classification accuracy of the classifiers. The disadvantage of the wrapper approach is its computational requirement when combined with sophisticated algorithms such as support vector machines.

As a filter approach, correlation-based feature selection (CFS) was proposed by Hall (1999). The rationale behind this algorithm is “a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other.” It has been shown in Hall (1999) that CFS gave comparable results to the wrapper and executes many times faster. It will be shown later in this paper that combining CFS with decision trees, the naïve Bayes algorithm and SVM, provides classification accuracy on cancer microarray data that is similar or better than published results. The rest of this paper is organised as follows. We begin with a brief introduction to feature subset selection, followed by a description of feature wrappers, filters and CFS, which is essentially a filter algorithm. We discuss the advantages and disadvantages of using wrappers and filters to select feature subsets. Thereafter, we present the experimental results on acute leukemia and lymphoma microarray data. The last section discusses the results and concludes this paper.

Section snippets

Feature subset selection

We now define the basic notions used in the paper. Given a microarray cancer data set $D$ , which contains n samples from different cancer types or subtypes, we have to build a mathematical model which can map the samples to their classes. Each sample has m genes as its features. The assumption here is that not all genes measured by a microarray are related to cancer classification. Some genes are irrelevant and some are redundant from the machine learning point of view. It is well-known that the

Analysis of acute leukemia data

The acute leukemia data of Golub et al. (1999) consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

Feature-ranking filters provide a natural way

Discussion

We have shown in this paper that feature subset selection algorithms, namely wrappers, filters and CFS, can be very useful in extracting relevant information in microarray data analysis. Wrapper approaches can choose the best genes for building classifiers while filters can provide a nice overview by ranking the genes for the particular problem at the hand. CFS can choose genes which are highly correlated to cancers yet uncorrelated to each other.

When the methods agree and select the same

Acknowledgement

We would like to thank Dr. Marco Zaffalon for proofreading the manuscript and validating our results with his algorithm, Dr. Franceso Bertoni for his advice on lymphoma data analysis, and Annina Neumann for proofreading the manuscript. We are also grateful for the comments given by reviewers, which have significantly improved this paper.

References (38)

A. Crawford et al.
Purification and characterization of zyxin, an 82,000-dalton component of adherens junctions
J. Biol. Chem.
(1991)
I. Inza et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artif. Intell. Med.
(2004)
K. Kira et al.
A practical approach for feature selection
R. Salgia et al.
p130^CAS forms a signaling complex with the adapter protein crkl in hematopoietic cells transformed by the bcr/abl oncogene
J. Biol. Chem.
(1996)
S. Tavor et al.
Restoration of c/ebpalpha expression in a bcr-abl+ cell line induces terminal granulocytic differentiation
J. Biol. Chem.
(2003)
E. van der Gaag et al.
Role of zyxin in differential cell spreading and proliferation of melanoma cells and melanocytes
J. Invest. Dermatol.
(2002)
Y. Wang et al.
Zyxin and paxillin proteins: focal adhesion plaque lim domain proteins go nuclear
Biochim Biophys Acta
(2003)
T. Yagi et al.
Identification of a gene expression signature associated with pediatric aml prognosis
Blood
(2003)
J. Yi et al.
Members of the zyxin family of lim proteins interact with members of the p130cas family of signal transducers
J. Biol. Chem.
(2002)
A. Agathanggelou et al.
Identification of novel gene expression targets for the ras association domain family 1 (rassf1a) tumor suppressor gene in non-small cell lung cancer and neuroblastoma
Cancer Res.
(2003)

A.A. Alizadeh et al.

Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling

Nature

(2000)

U. Alon et al.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

Proc. Natl. Acad. Sci.

(1999)

A. Antoniadis et al.

Effective dimension reduction methods for tumor classification using gene expression data

Bioinformatics

(2003)

A.V. Antonov et al.

Optimization models for cancer classification: extracting gene interaction information from microarray expression data

Bioinformatics

(2004)

U. Fayyad et al.

Multi-interval discretization of continuous-valued attributes for classification learning

E. Frank et al.

Data mining in bioinformatics using Weka

Bioinformatics

(2004)

T.S. Furey et al.

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Bioinformatics

(2000)

T. Golub et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

Hall, M.A., 1999. Correlation-based feature selection for machine learning. Ph.D. Thesis. Department of Computer...

Cited by (340)

A new filter-based gene selection approach in the DNA microarray domain
2024, Expert Systems with Applications
The high dimensionality of data hinders the learning ability of machine learning algorithms. Feature selection techniques can be used to reduce dimensionality, which is an important step for processing high-dimensional data. Feature selection solves this problem by removing irrelevant and redundant information, which can improve learning models, reduce calculation time, and improve learning accuracy. In this paper, a novel filter in mixed-attribute datasets for feature selection is proposed. The independent attributes are mixed or heterogeneous in the sense that both numerical and categorical attribute types may appear together in the same dataset. Based on the preordonnances theory, we use a new concept to quantify the relevance and redundancy of features even if there are heterogeneous (mixed-type) data. The technique for order preference by similarity to the ideal solution is one of the well-known multicriteria decision-making methods; it is utilized as a weighting and informative feature selection filter. To assess the effectiveness of the proposed method, several experiments, both simulated and real, are performed, including a comparison to other well-known filter methods. The experimental results show that, in most cases, the method yielded competitive results in comparison to other methods.
Hybrid black widow optimization with iterated greedy algorithm for gene selection problems
2023, Heliyon
Gene Selection (GS) is a strategy method targeted at reducing redundancy, limited expressiveness, and low informativeness in gene expression datasets obtained by DNA Microarray technology. These datasets contain a plethora of diverse and high-dimensional samples and genes, with a significant discrepancy in the number of samples and genes present. The complexities of GS are especially noticeable in the context of microarray expression data analysis, owing to the inherent data imbalance. The main goal of this study is to offer a simplified and computationally effective approach to dealing with the conundrum of attribute selection in microarray gene expression data. We use the Black Widow Optimization algorithm (BWO) in the context of GS to achieve this, using two unique methodologies: the unaltered BWO variation and the hybridized BWO variant combined with the Iterated Greedy algorithm (BWO-IG). By improving the local search capabilities of BWO, this hybridization attempts to promote more efficient gene selection. A series of tests was carried out using nine benchmark datasets that were obtained from the gene expression data repository in the pursuit of empirical validation. The results of these tests conclusively show that the BWO-IG technique performs better than the traditional BWO algorithm. Notably, the hybridized BWO-IG technique excels in the efficiency of local searches, making it easier to identify relevant genes and producing findings with higher levels of reliability in terms of accuracy and the degree of gene pruning. Additionally, a comparison analysis is done against five modern wrapper Feature Selection (FS) methodologies, namely BIMFOHHO, BMFO, BHHO, BCS, and BBA, in order to put the suggested BWO-IG method's effectiveness into context. The comparison that follows highlights BWO-IG's obvious superiority in reducing the number of selected genes while also obtaining remarkably high classification accuracy. The key findings were an average classification accuracy of 94.426, average fitness values of 0.061, and an average number of selected genes of 2933.767.
Benchmarking machine learning approaches to predict radiation-induced toxicities in lung cancer patients
2023, Clinical and Translational Radiation Oncology
Radiation-induced toxicities are common adverse events in lung cancer (LC) patients undergoing radiotherapy (RT). An accurate prediction of these adverse events might facilitate an informed and shared decision-making process between patient and radiation oncologist with a clearer view of life-balance implications in treatment choices. This work provides a benchmark of machine learning (ML) approaches to predict radiation-induced toxicities in LC patients built upon a real-world health dataset based on a generalizable methodology for their implementation and external validation.
Ten feature selection (FS) methods were combined with five ML-based classifiers to predict six RT-induced toxicities (acute esophagitis, acute cough, acute dyspnea, acute pneumonitis, chronic dyspnea, and chronic pneumonitis). A real-world health dataset (RWHD) built from 875 consecutive LC patients was used to train and validate the resulting 300 predictive models. Internal and external accuracy was calculated in terms of AUC per clinical endpoint, FS method, and ML-based classifier under analysis.
Best performing predictive models obtained per clinical endpoint achieved comparable performances to methods from state-of-the-art at internal validation (AUC ≥ 0.81 in all cases) and at external validation (AUC ≥ 0.73 in 5 out of 6 cases).
A benchmark of 300 different ML-based approaches has been tested against a RWHD achieving satisfactory results following a generalizable methodology. The outcomes suggest potential relationships between underrecognized clinical factors and the onset of acute esophagitis or chronic dyspnea, thus demonstrating the potential that ML-based approaches have to generate novel data-driven hypotheses in the field.
Gene selection of microarray data using Heatmap Analysis and Graph Neural Network
2023, Applied Soft Computing
It is not feasible to investigate the whole genes at a microscopic level for disease classification in Genomics. It might take substantial time to execute any meaningful analysis and the computational resources will be misused as not all the genes are responsible for the disease linked to a cell. Currently, it is quite challenging to select the most significant genes from high-dimensional microarray data for disease classification. In search of a better process, a novel gene subset selection technique has been developed based on Heatmap Analysis and Graph Neural Network (HAGNN). In the proposed method, a heatmap analysis has been performed for the different classes of microarray data to obtain the Region of Interest (ROIs). These ROIs are extracted from the original dataset and undergo a node reduction technique followed by an edge reduction technique in Graph Neural Network (GNN). This paper is concluded with an optimal subset of the most significant genes that cause cancer. The popular base classifiers have been used to evidence the importance of the selected genes as compared to the original data with the help of several metrics. The obtained results clearly show that the proposed methodology outperformed the other existing methods and make a greater impact on the advancement of the GNN-based gene selection method.
XML-GBM lung: An explainable machine learning-based application for the diagnosis of lung cancer
2023, Journal of Pathology Informatics
Lung cancer has been the leading cause of cancer-related deaths worldwide. Early detection and diagnosis of lung cancer can greatly improve the chances of survival for patients. Machine learning has been increasingly used in the medical sector for the detection of lung cancer, but the lack of interpretability of these models remains a significant challenge. Explainable machine learning (XML) is a new approach that aims to provide transparency and interpretability for machine learning models. The entire experiment has been performed in the lung cancer dataset obtained from Kaggle. The outcome of the predictive model with ROS (Random Oversampling) class balancing technique is used to comprehend the most relevant clinical features that contributed to the prediction of lung cancer using a machine learning explainable technique termed SHAP (SHapley Additive exPlanation). The results show the robustness of GBM's capacity to detect lung cancer, with 98.76% accuracy, 98.79% precision, 98.76% recall, 98.76% F-Measure, and 0.16% error rate, respectively. Finally, a mobile app is developed incorporating the best model to show the efficacy of our approach.
Meta-analysis of vaterite secondary data revealed the synthesis conditions for polymorphic control
2022, Chemical Engineering Research and Design
Citation Excerpt :
Besides their limitations, DTs are able to solve a wide array of classification problems. For instance, among their applications can be cited citation networks (Shibata et al., 2012), pharmaceutical manufacturing process (Gams et al., 2014), modelling building energy demand (Yu et al., 2010), weather forecast (Sá et al., 2011), diagnosis of diseases (Wang et al., 2005; Karegowda et al., 2010), detection of forest fires (Stojanova et al., 2006), agriculture (Cunningham and Holmes, 1999), finance (Olson et al., 2012), computer vision and many more (Ali et al., 2012). The raw vaterite dataset comprised of a total of 256 experiments.
The synthesis of vaterite was investigated from a statistical point of view to identify sets of optimal experimental conditions to obtain pure anhydrous calcium carbonate polymorph. Relevant research papers in the field of the precipitation of calcium carbonate were compiled in a secondary dataset using a statistical mixed method described in another of our publications. This statistical mixed method consisted of three distinctive stages: a systematic literature review (Stage 1), followed by a meta-analysis of the acquired secondary data (Stage 2) and the validation in the laboratory (Stage 3).
In this work we present the results of Stages 2 and 3 of the mentioned method. A decision tree was built with the vaterite dataset and obtained good classification performance. A number of if-then decision rules were created covering the occurrence and absence of vaterite. The oven drying temperature, the pH and the concentration of the salt were used to control polymorphism. The best result corresponded to a vaterite polymorphic abundance of 93.6 ± 0.3 %. It was possible to carry out a different investigation and arrive at new insights as a result of the unique size and characteristics of the mined data from Web of Science scientific articles.

View all citing articles on Scopus

View full text

Gene selection from microarray data for cancer classification—a machine learning approach

Abstract

Introduction

Section snippets

Feature subset selection

Analysis of acute leukemia data

Discussion

Acknowledgement

J. Biol. Chem.

Artif. Intell. Med.

J. Biol. Chem.

J. Biol. Chem.

J. Invest. Dermatol.

Biochim Biophys Acta

Blood

J. Biol. Chem.

Identification of novel gene expression targets for the ras association domain family 1 (rassf1a) tumor suppressor gene in non-small cell lung cancer and neuroblastoma

Cancer Res.

Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling

Nature

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

Proc. Natl. Acad. Sci.

Effective dimension reduction methods for tumor classification using gene expression data

Bioinformatics

Optimization models for cancer classification: extracting gene interaction information from microarray expression data

Bioinformatics

Multi-interval discretization of continuous-valued attributes for classification learning

Data mining in bioinformatics using Weka

Bioinformatics

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Bioinformatics

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science