A review of microarray datasets and applied feature selection methods
Introduction
During the last two decades, the advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Although there are usually very small samples (often less than 100 patients) for training and testing, the number of features in the raw data ranges from 6000 to 60,000, since it measures the gene expression en masse. A typical classification task is to separate healthy patients from cancer patients based on their gene expression “profile” (binary approach). There are also datasets in which the goal is to distinguish among different types of tumors (multiclass approach), making the task even more complicated.
Therefore, microarray data pose a serious challenge for machine learning researchers. Having so many fields relative to so few samples creates a high likelihood of finding “false positives” due to chance (both in finding relevant genes and in building predictive models) [94]. It becomes necessary to find robust methods to validate the models and assess their likelihood. Furthermore, additional experimental complications (like noise and variability) render the analysis of microarray data an exciting domain [98].
Several studies have shown that most genes measured in a DNA microarray experiment are not relevant in the accurate classification of different classes of the problem [46]. To avoid the problem of the “curse of dimensionality” [62], feature (gene) selection plays a crucial role in DNA microarray analysis, which is defined as the process of identifying and removing irrelevant features from the training data, so that the learning algorithm focuses only on those aspects of the training data useful for analysis and future prediction [50]. There are usually three varieties of feature selection methods: filters, wrappers and embedded methods. While wrapper models involve optimizing a predictor as part of the selection process, filter models rely on the general characteristics of the training data to select features independent of any predictor. The embedded methods generally use machine learning models for classification, and then an optimal subset of features is built by the classifier algorithm. Of course, the interaction with the classifier required by wrapper and embedded methods comes with an important computational burden (more important in the case of wrappers). In addition to this classification, feature selection methods may also be divided into univariate and multivariate types. Univariate methods consider each feature independently of other features, a drawback that can be overcome by multivariate techniques that incorporate feature dependencies to some degree, at the cost of demanding more computational resources [26].
Feature selection as a preprocessing step to tackle microarray data has rapidly become indispensable among researchers, not only to remove redundant and irrelevant features, but also to help biologists identify the underlying mechanism that relates gene expression to diseases. This research area has received significant attention in recent years (most of the work has been published in the last decade), and new algorithms have emerged as alternatives to the existing ones. However, when a new method is proposed, there is a lack of standard state-of-the-art results to perform a fair comparative study. Furthermore, there is a broad suite of microarray datasets to be used in the experiments, some of which even have the same name, but the number of samples or characteristics are different in different studies, which makes this task more complicated.
The main goal of the research presented here is to provide a review of the existing feature selection methods developed to be applied to DNA microarray data. In addition to this, we pay attention to the datasets used, their intrinsic data characteristics and the behavior of classical feature selection algorithms available in data mining software tools used for microarray data. In this manner, the reader can be aware of the particularities of this type of data as well as its problematics, such as the imbalance of the data, their complexity, the presence of overlapping and outliers, or the so-called dataset shift. These problematics render the analysis of microarray data an interesting domain.
We have designed an experimental study in such a way that we can draw well-founded conclusions. We use a set of nine two-class microarray datasets, which suffer from problems such as class imbalance, overlapping or dataset shift. Some of these datasets were originally divided into training and test datasets, so a holdout validation is performed on them. For the remaining datasets, we choose to evaluate them with a k-fold cross-validation, since it is a common choice in the literature [81], [107], [86], [101], [31], [105], [125]. However, it has been shown that cross-validation can potentially introduce dataset shift, so we include another strategy to create the partitioning, called Distribution optimally balanced stratified cross-validation (DOB-SCV) [84]. We consider C4.5, Support Vector Machine (SVM) and naive Bayes as classifiers, and we use classification accuracy, sensitivity and specificity on the test partitions as the evaluation criteria.
The remainder of the paper is organized as follows: Section 2 introduces the background and the first attempts to deal with microarray datasets. In Section 3 we review the state of the art on feature selection methods applied to this type of data, including the classical techniques (filters, embedded and wrappers) as well as other more recent approaches. Next, Section 4 is focused on the particularities of the datasets, from providing a summary of the characteristics of the most famous datasets used in the literature and existing repositories to the analysis of the inherent problematics of microarray data, such as the small-sample size, the imbalance of the data, the dataset shift or the presence of outliers. In Section 5 we present an experimental study of the most significant algorithms and evaluation techniques. A deep analysis of the findings of this study is also provided. Finally, in Section 6, we make our concluding remarks.
Section snippets
Background: the problem and first attempts
All cells have a nucleus, and inside this nucleus there is DNA, which encodes the “program” for future organisms. DNA has coding and non-coding segments. The coding segments, also known as genes, specify the structure of proteins, which do the essential work in every organism. Genes make proteins in two steps: DNA is transcribed into mRNA and then mRNA is translated into proteins. Advances in molecular genetics technologies, such as DNA microarrays, allow us to obtain a global view of the cell,
Algorithms for feature selection on microarray data: a review
Feature selection methods are constantly emerging and, for this reason, there is a wide suite of methods that deal with microarray gene data. The aim of this section is to present those methods developed in the last few years. Traditionally, the most employed gene selection methods fall into the filter approach (see Section 3.1). Most of the novel filter methods proposed are based on information theory, although issues such as robustness or division in multiple binary problems are emerging
Microarray datasets
After reviewing the most up-to-date feature selection methods dealing with microarray data, this section will be focused on the particularities of the datasets. First, Section 4.1 will enumerate the existing microarray data repositories, whilst Section 4.2 provides a summary of the characteristics of the most famous binary and multiclass datasets used in the literature. Finally, Section 4.3 is devoted to an analysis of the problematics of microarray data, such as the small-sample size, the
An experimental study in binary classification: analysis of results
The goal of performing feature selection on microarray data can be twofold: class prediction or biomarkers’ identification. If the goal is class prediction, there is a demand for machine learning techniques such as supervised classification. However, if the objective is to find informative genes, the classification performance is ignored and the selected genes have to be individually evaluated. The experiments that will be presented in this section are focused on class prediction, which is an
Conclusions
This article reviews the up-to-date contributions of feature selection research applied to the field of DNA microarray data analysis, as well as the datasets used. The advent of this type of data has posed a big challenge for machine learning researchers, because of the large input dimensionality and small sample size. Since the infancy of microarray data classification, feature selection has become an imperative step, in order to reduce the number of features (genes).
Since the end of the
Acknowledgments
This research has been economically supported in part by the Secretaría de Estado de Investigación of the Spanish Government through the research Projects TIN 2011-28488 and TIN 2012-37954; by the Consellería de Industria of the Xunta de Galicia through the research Projects CN2011/007 and CN2012/211; and by the regional Project P11-TIC-9704; all of them partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges the support of Xunta de Galicia under Plan I2C Grant
References (128)
- et al.
An ensemble of filters and classifiers for microarray data classification
Pattern Recognit.
(2012) - et al.
Data classification using an ensemble of filters
Neurocomputing
(2014) - et al.
A hybrid feature selection method for dna microarray data
Comput. Biol. Med.
(2011) - et al.
An unsupervised approach to feature discretization and selection
Pattern Recognit.
(2012) - et al.
Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
(2013) - et al.
Filter versus wrapper gene selection approaches in dna microarray domains
Artif. Intell. Med.
(2004) - et al.
A novel hybrid feature selection method for microarray data analysis
Appl. Soft Comput.
(2011) - et al.
An extensive comparison of recent classification tools applied to microarray data
Comput. Statist. Data Anal.
(2005) - et al.
An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics
Inf. Sci.
(2013) - et al.
Analysis of complexity indices for classification problems: cancer gene expression data
Neurocomputing
(2012)
Simultaneous feature selection and classification using kernel-penalized support vector machines
Inf. Sci.
Prediction of cancer outcome with microarrays: a multiple random validation strategy
Lancet
A unifying view on dataset shift in classification
Pattern Recognit.
Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors
Artif. Intell. Med.
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognit.
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification
Pattern Recognit.
Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
Bioinformatics
Keel: a software tool to assess evolutionary algorithms for data mining problems
Soft Comput.
Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling
Nature
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proc. Nat. Acad. Sci.
Selection bias in gene extraction on the basis of microarray gene-expression data
Proc. Nat. Acad. Sci.
Feature selection of imbalanced gene expression microarray data
Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia
Nat. Genet.
KNIME: The Konstanz Information Miner
Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses
Proc. Nat. Acad. Sci.
Evaluation of smote for high-dimensional class-imbalanced microarray data
Gene selection for cancer classification using wrapper approaches
Int. J. Pattern Recognit. Artif. Intell.
On the effectiveness of discretization on gene selection of microarray data
A review of feature selection methods on synthetic data
Knowl. Inf. Syst.
Fads and fallacies in the name of small-sample microarray classification-a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing
IEEE Signal Process. Mag.
Is cross-validation valid for small-sample microarray classification?
Bioinformatics
Knowledge-based analysis of microarray gene expression data by using support vector machines
Proc. Nat. Acad. Sci.
Iterative feature perturbation as a gene selector for microarray data
Int. J. Pattern Recognit. Artif. Intell.
Smote: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Minimum redundancy feature selection from microarray gene expression data
J. Bioinformatics Comput. Biol.
Small sample issues for microarray-based classification
Comp. Funct. Genom.
Comparison of discrimination methods for the classification of tumors using gene expression data
J. Am. Stat. Assoc.
Cited by (564)
A new filter-based gene selection approach in the DNA microarray domain
2024, Expert Systems with ApplicationsAn embedded feature selection method based on generalized classifier neural network for cancer classification
2024, Computers in Biology and MedicineMulti-source and multimodal data fusion for improved management of a wastewater treatment plant
2023, Journal of Environmental Chemical EngineeringDual regularized subspace learning using adaptive graph learning and rank constraint: Unsupervised feature selection on gene expression microarray datasets
2023, Computers in Biology and MedicineGene selection with Game Shapley Harris hawks optimizer for cancer classification
2023, Chemometrics and Intelligent Laboratory SystemsFeature selection using a sinusoidal sequence combined with mutual information
2023, Engineering Applications of Artificial Intelligence