Elsevier

Information Sciences

Volume 282, 20 October 2014, Pages 111-135
Information Sciences

A review of microarray datasets and applied feature selection methods

https://doi.org/10.1016/j.ins.2014.05.042Get rights and content

Highlights

  • Feature selection is often required for microarray data classification.

  • We review the most up-to-date feature selection methods in this field.

  • We show the problematics of microarray data.

  • We present an experimental evaluation on the most representative methods.

Abstract

Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. Feature selection has been soon considered a de facto standard in this field since its introduction, and a huge number of feature selection methods were utilized trying to reduce the input dimensionality while improving the classification performance. This paper is devoted to reviewing the most up-to-date feature selection methods developed in this field and the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, or the so-called dataset shift. Finally, an experimental evaluation on the most representative datasets using well-known feature selection methods is presented, bearing in mind that the aim is not to provide the best feature selection method, but to facilitate their comparative study by the research community.

Introduction

During the last two decades, the advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Although there are usually very small samples (often less than 100 patients) for training and testing, the number of features in the raw data ranges from 6000 to 60,000, since it measures the gene expression en masse. A typical classification task is to separate healthy patients from cancer patients based on their gene expression “profile” (binary approach). There are also datasets in which the goal is to distinguish among different types of tumors (multiclass approach), making the task even more complicated.

Therefore, microarray data pose a serious challenge for machine learning researchers. Having so many fields relative to so few samples creates a high likelihood of finding “false positives” due to chance (both in finding relevant genes and in building predictive models) [94]. It becomes necessary to find robust methods to validate the models and assess their likelihood. Furthermore, additional experimental complications (like noise and variability) render the analysis of microarray data an exciting domain [98].

Several studies have shown that most genes measured in a DNA microarray experiment are not relevant in the accurate classification of different classes of the problem [46]. To avoid the problem of the “curse of dimensionality” [62], feature (gene) selection plays a crucial role in DNA microarray analysis, which is defined as the process of identifying and removing irrelevant features from the training data, so that the learning algorithm focuses only on those aspects of the training data useful for analysis and future prediction [50]. There are usually three varieties of feature selection methods: filters, wrappers and embedded methods. While wrapper models involve optimizing a predictor as part of the selection process, filter models rely on the general characteristics of the training data to select features independent of any predictor. The embedded methods generally use machine learning models for classification, and then an optimal subset of features is built by the classifier algorithm. Of course, the interaction with the classifier required by wrapper and embedded methods comes with an important computational burden (more important in the case of wrappers). In addition to this classification, feature selection methods may also be divided into univariate and multivariate types. Univariate methods consider each feature independently of other features, a drawback that can be overcome by multivariate techniques that incorporate feature dependencies to some degree, at the cost of demanding more computational resources [26].

Feature selection as a preprocessing step to tackle microarray data has rapidly become indispensable among researchers, not only to remove redundant and irrelevant features, but also to help biologists identify the underlying mechanism that relates gene expression to diseases. This research area has received significant attention in recent years (most of the work has been published in the last decade), and new algorithms have emerged as alternatives to the existing ones. However, when a new method is proposed, there is a lack of standard state-of-the-art results to perform a fair comparative study. Furthermore, there is a broad suite of microarray datasets to be used in the experiments, some of which even have the same name, but the number of samples or characteristics are different in different studies, which makes this task more complicated.

The main goal of the research presented here is to provide a review of the existing feature selection methods developed to be applied to DNA microarray data. In addition to this, we pay attention to the datasets used, their intrinsic data characteristics and the behavior of classical feature selection algorithms available in data mining software tools used for microarray data. In this manner, the reader can be aware of the particularities of this type of data as well as its problematics, such as the imbalance of the data, their complexity, the presence of overlapping and outliers, or the so-called dataset shift. These problematics render the analysis of microarray data an interesting domain.

We have designed an experimental study in such a way that we can draw well-founded conclusions. We use a set of nine two-class microarray datasets, which suffer from problems such as class imbalance, overlapping or dataset shift. Some of these datasets were originally divided into training and test datasets, so a holdout validation is performed on them. For the remaining datasets, we choose to evaluate them with a k-fold cross-validation, since it is a common choice in the literature [81], [107], [86], [101], [31], [105], [125]. However, it has been shown that cross-validation can potentially introduce dataset shift, so we include another strategy to create the partitioning, called Distribution optimally balanced stratified cross-validation (DOB-SCV) [84]. We consider C4.5, Support Vector Machine (SVM) and naive Bayes as classifiers, and we use classification accuracy, sensitivity and specificity on the test partitions as the evaluation criteria.

The remainder of the paper is organized as follows: Section 2 introduces the background and the first attempts to deal with microarray datasets. In Section 3 we review the state of the art on feature selection methods applied to this type of data, including the classical techniques (filters, embedded and wrappers) as well as other more recent approaches. Next, Section 4 is focused on the particularities of the datasets, from providing a summary of the characteristics of the most famous datasets used in the literature and existing repositories to the analysis of the inherent problematics of microarray data, such as the small-sample size, the imbalance of the data, the dataset shift or the presence of outliers. In Section 5 we present an experimental study of the most significant algorithms and evaluation techniques. A deep analysis of the findings of this study is also provided. Finally, in Section 6, we make our concluding remarks.

Section snippets

Background: the problem and first attempts

All cells have a nucleus, and inside this nucleus there is DNA, which encodes the “program” for future organisms. DNA has coding and non-coding segments. The coding segments, also known as genes, specify the structure of proteins, which do the essential work in every organism. Genes make proteins in two steps: DNA is transcribed into mRNA and then mRNA is translated into proteins. Advances in molecular genetics technologies, such as DNA microarrays, allow us to obtain a global view of the cell,

Algorithms for feature selection on microarray data: a review

Feature selection methods are constantly emerging and, for this reason, there is a wide suite of methods that deal with microarray gene data. The aim of this section is to present those methods developed in the last few years. Traditionally, the most employed gene selection methods fall into the filter approach (see Section 3.1). Most of the novel filter methods proposed are based on information theory, although issues such as robustness or division in multiple binary problems are emerging

Microarray datasets

After reviewing the most up-to-date feature selection methods dealing with microarray data, this section will be focused on the particularities of the datasets. First, Section 4.1 will enumerate the existing microarray data repositories, whilst Section 4.2 provides a summary of the characteristics of the most famous binary and multiclass datasets used in the literature. Finally, Section 4.3 is devoted to an analysis of the problematics of microarray data, such as the small-sample size, the

An experimental study in binary classification: analysis of results

The goal of performing feature selection on microarray data can be twofold: class prediction or biomarkers’ identification. If the goal is class prediction, there is a demand for machine learning techniques such as supervised classification. However, if the objective is to find informative genes, the classification performance is ignored and the selected genes have to be individually evaluated. The experiments that will be presented in this section are focused on class prediction, which is an

Conclusions

This article reviews the up-to-date contributions of feature selection research applied to the field of DNA microarray data analysis, as well as the datasets used. The advent of this type of data has posed a big challenge for machine learning researchers, because of the large input dimensionality and small sample size. Since the infancy of microarray data classification, feature selection has become an imperative step, in order to reduce the number of features (genes).

Since the end of the

Acknowledgments

This research has been economically supported in part by the Secretaría de Estado de Investigación of the Spanish Government through the research Projects TIN 2011-28488 and TIN 2012-37954; by the Consellería de Industria of the Xunta de Galicia through the research Projects CN2011/007 and CN2012/211; and by the regional Project P11-TIC-9704; all of them partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges the support of Xunta de Galicia under Plan I2C Grant

References (128)

  • S. Maldonado et al.

    Simultaneous feature selection and classification using kernel-penalized support vector machines

    Inf. Sci.

    (2011)
  • S. Michiels et al.

    Prediction of cancer outcome with microarrays: a multiple random validation strategy

    Lancet

    (2005)
  • J.G. Moreno-Torres et al.

    A unifying view on dataset shift in classification

    Pattern Recognit.

    (2012)
  • O. Okun et al.

    Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

    Artif. Intell. Med.

    (2009)
  • E. Petricoin et al.

    Use of proteomic patterns in serum to identify ovarian cancer

    Lancet

    (2002)
  • R. Ruiz et al.

    Incremental wrapper-based gene selection from microarray data for cancer classification

    Pattern Recognit.

    (2006)
  • J.A. Saez et al.

    Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification

    Pattern Recognit.

    (2013)
  • Arrayexpress – Functional Genomics Data. <http://www.ebi.ac.uk/arrayexpress/> (accessed January,...
  • Broad institute. Cancer Program Data Sets. <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi> (accessed...
  • Dataset Repository, Bioinformatics Research Group. <http://www.upo.es/eps/bigs/datasets.html> (accessed January,...
  • Feature Selection Algorithms at Arizona State University. <http://featureselection.asu.edu/software.php> (accessed...
  • Feature Selection Datasets at Arizona State University. <http://featureselection.asu.edu/datasets.php> (accessed...
  • Gene Expression Omnibus. <http://www.ncbi.nlm.nih.gov/geo/> (accessed January,...
  • Gene Expression Project, Princeton University. <http://genomics-pubs.princeton.edu/oncology/> (accessed January,...
  • Kent Ridge Bio-Medical Dataset. <http://datam.i2r.a-star.edu.sg/datasets/krbd> (accessed January,...
  • Microarray Cancers, Plymouth University....
  • Stanford Microarray Database. <http://smd.stanford.edu/> (accessed January,...
  • T. Abeel et al.

    Robust biomarker identification for cancer diagnosis with ensemble feature selection methods

    Bioinformatics

    (2000)
  • J. Alcalá-Fdez et al.

    Keel: a software tool to assess evolutionary algorithms for data mining problems

    Soft Comput.

    (2009)
  • A. Alizadeh et al.

    Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • U. Alon et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proc. Nat. Acad. Sci.

    (1999)
  • C. Ambroise et al.

    Selection bias in gene extraction on the basis of microarray gene-expression data

    Proc. Nat. Acad. Sci.

    (2002)
  • A. Anaissi et al.

    Feature selection of imbalanced gene expression microarray data

  • S. Armstrong et al.

    Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia

    Nat. Genet.

    (2002)
  • V. Barnett et al.
    (1994)
  • M.R. Berthold et al.

    KNIME: The Konstanz Information Miner

  • A. Bhattacharjee et al.

    Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses

    Proc. Nat. Acad. Sci.

    (2001)
  • R. Blagus et al.

    Evaluation of smote for high-dimensional class-imbalanced microarray data

  • R. Blanco et al.

    Gene selection for cancer classification using wrapper approaches

    Int. J. Pattern Recognit. Artif. Intell.

    (2004)
  • V. Bolón-Canedo et al.

    On the effectiveness of discretization on gene selection of microarray data

  • V. Bolón-Canedo et al.

    A review of feature selection methods on synthetic data

    Knowl. Inf. Syst.

    (2013)
  • V. Bolón-Canedo, S. Seth, A. Sánchez-Maroño, N. Alonso-Betanzos, J. Principe, Statistical dependence measure for...
  • U. Braga-Neto

    Fads and fallacies in the name of small-sample microarray classification-a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing

    IEEE Signal Process. Mag.

    (2007)
  • U. Braga-Neto et al.

    Is cross-validation valid for small-sample microarray classification?

    Bioinformatics

    (2004)
  • M. Brown et al.

    Knowledge-based analysis of microarray gene expression data by using support vector machines

    Proc. Nat. Acad. Sci.

    (2000)
  • J. Canul-Reich et al.

    Iterative feature perturbation as a gene selector for microarray data

    Int. J. Pattern Recognit. Artif. Intell.

    (2012)
  • N. Chawla et al.

    Smote: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • C. Ding et al.

    Minimum redundancy feature selection from microarray gene expression data

    J. Bioinformatics Comput. Biol.

    (2005)
  • E. Dougherty

    Small sample issues for microarray-based classification

    Comp. Funct. Genom.

    (2001)
  • S. Dudoit et al.

    Comparison of discrimination methods for the classification of tumors using gene expression data

    J. Am. Stat. Assoc.

    (2002)
  • Cited by (564)

    • Feature selection using a sinusoidal sequence combined with mutual information

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text