Gene selection from microarray data for cancer classification—a machine learning approach

https://doi.org/10.1016/j.compbiolchem.2004.11.001Get rights and content

Abstract

A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.

Introduction

Accurate cancer diagnosis is vital for the successful application of specific therapies. Although cancer classification has improved over the last decade, there is still a need for a fully automated and less subjective method for cancer diagnosis. Recent studies demonstrated that DNA microarrays could provide useful information for cancer classification at the gene expression level due to their ability to measure the abundance of messenger ribonucleic acid (mRNA) transcripts for thousands of genes simultaneously.

Several machine learning algorithms have already been applied to classifying tumors using microarray data. Voting machines and self-organising maps (SOM) were used to analyse acute leukemia (Golub et al., 1999). Support vector machines (SVMs) were applied to multi-class cancer diagnosis by (Ramaswamy et al., 2001). Hierarchical clustering was used to analyse colon tumor (Alon et al., 1999). The best classification results are reported by Li et al. (2003) and Antonov et al. (2004). Li et al. employed a rule discovery method and Antonov et al. maximal margin linear programming (MAMA).

Given the nature of cancer microarray data, which usually consists of a few hundred samples with thousands of genes as features, the analysis has to be carried out carefully. Work in such a high dimensional space is extremely difficult if not impossible. One straightforward approach to select relevant genes is the application of standard parametric tests such as the t-test Thomas et al., 2001, Tsai et al., 2003 and a non-parametric test such as the Wilcoxon score test Thomas et al., 2001, Antoniadis et al., 2003. Wilks’s Lambda score was proposed by (Hwang et al., 2002) to access the discriminatory power of individual genes. A new procedure (Antonov et al., 2004) was designed to detect groups of genes that are strongly associated with a particular cancer type.

In this paper we consider two general approaches to feature subset selection, more specifically, wrapper and filter approaches, for gene selection. Wrappers and filters differ in how they evaluate feature subsets. Filter approaches remove irrelevant features according to general characteristics of the data. Wrapper approaches, by contrast, apply machine learning algorithms to feature subsets and use cross-validation to evaluate the score of feature subsets. Most methods of gene selection for microarray data analysis focus on filter approaches, although there are a few publications on applying wrapper approaches Inza et al., 2004, Xiong et al., 2001, Xing et al., 2001. Nevertheless, in theory, wrappers should provide more accurate classification results than filters (Langley, 1994). Wrappers use classifiers to estimate the usefulness of feature subsets. The use of “tailor-made” feature subsets should provide a better classification accuracy for the corresponding classifiers, since the features are selected according to their contribution to the classification accuracy of the classifiers. The disadvantage of the wrapper approach is its computational requirement when combined with sophisticated algorithms such as support vector machines.

As a filter approach, correlation-based feature selection (CFS) was proposed by Hall (1999). The rationale behind this algorithm is “a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other.” It has been shown in Hall (1999) that CFS gave comparable results to the wrapper and executes many times faster. It will be shown later in this paper that combining CFS with decision trees, the naïve Bayes algorithm and SVM, provides classification accuracy on cancer microarray data that is similar or better than published results. The rest of this paper is organised as follows. We begin with a brief introduction to feature subset selection, followed by a description of feature wrappers, filters and CFS, which is essentially a filter algorithm. We discuss the advantages and disadvantages of using wrappers and filters to select feature subsets. Thereafter, we present the experimental results on acute leukemia and lymphoma microarray data. The last section discusses the results and concludes this paper.

Section snippets

Feature subset selection

We now define the basic notions used in the paper. Given a microarray cancer data set D, which contains n samples from different cancer types or subtypes, we have to build a mathematical model which can map the samples to their classes. Each sample has m genes as its features. The assumption here is that not all genes measured by a microarray are related to cancer classification. Some genes are irrelevant and some are redundant from the machine learning point of view. It is well-known that the

Analysis of acute leukemia data

The acute leukemia data of Golub et al. (1999) consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

Feature-ranking filters provide a natural way

Discussion

We have shown in this paper that feature subset selection algorithms, namely wrappers, filters and CFS, can be very useful in extracting relevant information in microarray data analysis. Wrapper approaches can choose the best genes for building classifiers while filters can provide a nice overview by ranking the genes for the particular problem at the hand. CFS can choose genes which are highly correlated to cancers yet uncorrelated to each other.

When the methods agree and select the same

Acknowledgement

We would like to thank Dr. Marco Zaffalon for proofreading the manuscript and validating our results with his algorithm, Dr. Franceso Bertoni for his advice on lymphoma data analysis, and Annina Neumann for proofreading the manuscript. We are also grateful for the comments given by reviewers, which have significantly improved this paper.

References (38)

  • A.A. Alizadeh et al.

    Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • U. Alon et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proc. Natl. Acad. Sci.

    (1999)
  • A. Antoniadis et al.

    Effective dimension reduction methods for tumor classification using gene expression data

    Bioinformatics

    (2003)
  • A.V. Antonov et al.

    Optimization models for cancer classification: extracting gene interaction information from microarray expression data

    Bioinformatics

    (2004)
  • U. Fayyad et al.

    Multi-interval discretization of continuous-valued attributes for classification learning

  • E. Frank et al.

    Data mining in bioinformatics using Weka

    Bioinformatics

    (2004)
  • T.S. Furey et al.

    Support vector machine classification and validation of cancer tissue samples using microarray expression data

    Bioinformatics

    (2000)
  • T. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • Hall, M.A., 1999. Correlation-based feature selection for machine learning. Ph.D. Thesis. Department of Computer...
  • Cited by (340)

    • Meta-analysis of vaterite secondary data revealed the synthesis conditions for polymorphic control

      2022, Chemical Engineering Research and Design
      Citation Excerpt :

      Besides their limitations, DTs are able to solve a wide array of classification problems. For instance, among their applications can be cited citation networks (Shibata et al., 2012), pharmaceutical manufacturing process (Gams et al., 2014), modelling building energy demand (Yu et al., 2010), weather forecast (Sá et al., 2011), diagnosis of diseases (Wang et al., 2005; Karegowda et al., 2010), detection of forest fires (Stojanova et al., 2006), agriculture (Cunningham and Holmes, 1999), finance (Olson et al., 2012), computer vision and many more (Ali et al., 2012). The raw vaterite dataset comprised of a total of 256 experiments.

    View all citing articles on Scopus
    View full text