Elsevier

Neurocomputing

Volume 101, 4 February 2013, Pages 309-318
Neurocomputing

ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data

https://doi.org/10.1016/j.neucom.2012.08.018Get rights and content

Abstract

In DNA microarray data, class imbalance problem occurs frequently, causing poor prediction performance for minority classes. Moreover, its other features, such as high-dimension, small sample, high noise etc., intensify this damage. In this study, we propose ACOSampling that is a novel undersampling method based on the idea of ant colony optimization (ACO) to address this problem. The algorithm starts with feature selection technology to eliminate noisy genes in data. Then we randomly and repeatedly divided the original training set into two groups: training set and validation set. In each division, one modified ACO algorithm as a variant of our previous work is conducted to filter less informative majority samples and search the corresponding optimal training sample subset. At last, the statistical results from all local optimal training sample subsets are given in the form of frequence list, where each frequence indicates the importance of the corresponding majority sample. We only extracted those high frequency ones and combined them with all minority samples to construct the final balanced training set. We evaluated the method on four benchmark skewed DNA microarray datasets by support vector machine (SVM) classifier, showing that the proposed method outperforms many other sampling approaches, which indicates its superiority.

Highlights

► ACO algorithm is modified for undersampling skewed DNA microarray data. ► The significance of each majority sample is estimated by ranking frequence list. ► ACOSampling increases classification performance but spends more time. ► Selecting a few feature genes helps to improve classification performance. ► Some classification tasks are harmful and the others are unharmful.

Introduction

In the past decade, DNA microarray has been one of the most important molecular biology technologies in the post-genomic era. By this technology, biologists and medical experts are permitted to detect the activity of thousands of genes in a cell simultaneously. At present, DNA microarray has been widely applied to predict gene functions [1], investigate gene regulatory mechanisms [2], [3], provide invaluable information for drug discovery [4], classify for cancer [5], [6] and mining new subtypes of a specific tumor [7], [8], [9] etc. Among these applications, cancer classification has attracted more attentions. However, it is well-known that microarray data generally has some particular features, such as high-dimension, small sample, high noise and most importantly, imbalanced class distributions. Skewed class distributions will underestimate greatly the prediction performance for minority classes and provide inaccurate evaluation for classification performance, while the other features of microarray data will further intensify this damage [10]. Therefore, it is necessary to remedy this bias by some effective strategies.

In fact, class imbalance learning has drawn a significant amount of interest since 2000 from artificial intelligence, data mining and machine learning, which can be reflected by launch of several major workshops and special issues [11], including AAAI’00 [12], ICML’03 [13] and ACM SIGKDD Explorations’04 [14] etc. There are two major methods to solve class imbalance problem: sampling-based strategy and cost sensitive learning. Sampling, which includes oversampling and undersampling, deals with class imbalance by inserting samples for minority class or discarding samples of majority class [15], [16], [17], [18], [19], [20]. While cost-sensitive learning treats class imbalance by incurring different costs for different classes [21], [22], [23], [24], [25], [26], [27], [28], [29]. Recently, some research also focused on ensemble learning built on multiple different sampling or weighting data sets with presenting excellent performance and generalization ability [30], [31], [32], [33], [34], [35]. More details about class imbalance learning methods are presented in Section 2.

In this study, we introduce a novel undersampling method based on the idea of ant colony optimization (ACO), which is named ACOSampling, to classify for skewed DNA microarray data. In fact, this method is a modified version of our previous work [36], the difference between them is this work converts the information selection from feature space to sample space. First, the original training dataset is randomly and repeatedly divided into two groups: training dataset and validation dataset. Then for each partition, ACOSampling is conducted to find the corresponding optimal majority class sample subset. Different from the traditional ACO algorithm, ACOSampling impels ants to leave from the nest, then to pass all majority class samples one by one, by either pathway 0 or pathway 1, at last to reach the food source, where pathway 0 indicates the corresponding sample is useless and should be filtered, while pathway 1 represents it is important and should be selected. Considering the particularity of the classification tasks in this study, the overall accuracy is not an excellent measure as the fitness function, thus we construct it by three weighted indicative metrics, namely F-measure, G-mean and AUC, respectively. After that, many local optimal majority class sample subsets can be generated by iterative partitions, so the significance of each majority sample may be estimated according to its selection frequence, i.e., the higher the selection frequence, the more information the corresponding sample can provide. Next, a global optimum balanced sample set can be created by combining the highly ranked samples of majority class with all examples of minority class. At last, we construct a SVM classifier upon the balanced training set for recognizing future unlabeled samples.

The remainder of this paper is organized as follows. Section 2 reviews some previous work related with class imbalance problem. In Section 3, the idea and procedure of ACOSampling method is described in detail. Experimental results and discussions are presented in Section 4. At last, we conclude this paper in Section 5.

Section snippets

Previous work

As mentioned in the Section 1, the existing class imbalance learning methods could be roughly categorized into two major groups: sampling strategy and cost sensitive learning. Here, we pay special attention to sampling strategy because it is more related with our study.

The sampling is actually a re-balancing process for the given imbalanced data set. It can be distinguished into oversampling and undersampling. Oversampling, as its name indicates, increases some samples belonging to minority

Undersampling based on ant colony optimization

Ant colony optimization (ACO) algorithm, which is developed by Colorni et al. [38], is one important member of swarm intelligence family. ACO simulates the behavior of foraging by real ant colony and in recent years, it has been successfully applied to solve various practical optimization problems, including TSP [39], parameter optimization [40], path planning [41], protein folding [42] etc.

In previous work, we have once designed an ACO algorithm to select tumor-related marker genes in DNA

Datasets

Four benchmark imbalanced microarray datasets are used in our experiments, including Colon dataset [5], CNS (Central Neural System) dataset [54], Lung cancer dataset [8] and Glioma dataset [9]. The first three datasets are composed of binary class samples and Glioma dataset consists of four subtypes: cancer glioblastomas (CG), non-cancer glioblastomas (NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas (NO). For Glioma dataset, CG is used as the positive class with 14

Conclusions

In this paper, we present a novel heuristic undersampling method named as ACOSampling to address imbalanced DNA microarray data classification problem. By extensive experiments, it has demonstrated that the proposed method is effective and can automatically extract those so-called “information samples” from majority class. However, since its procedure of sampling is more time-consuming than those typical sampling approaches, it will be more efficient on small sample classification tasks.

Acknowledgements

This work is partially supported by National Natural Science Foundation of China under grant No.60873036, the Ph.D Programs Foundation for young researchers of Ministry of Education of China under Grant No.20070217051 and Ph.D Foundation of Jiangsu University of Science and Technology under Grant No.35301002.

Hualong Yu received the B.S. degree from Heilongjiang University, Harbin, China, in 2005. Then he received M.S. and Ph.D degree from the college of computer science and technology, Harbin Engineering University, Harbin, China, in 2008 and 2010, respectively. Since 2010, he has been one lecturer and master supervisor in Jiangsu University of Science and Technology, Zhenjiang, China. He is author or co-author for over 20 research papers, 3 books and the program committee member for ICICSE2012.

References (56)

  • W.E. Evans et al.

    Gene expression as a drug discovery tool

    Nat. Genet.

    (2004)
  • U. Alon et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by Oligonucleotide array

    Proc. Nat. Acad. Sci. U.S.A.

    (1999)
  • T.P. Conrads et al.

    Cancer diagnosis using proteomic patterns

    Expert Rev. Mol. Diagn.

    (2003)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • D.A. Wigle et al.

    Molecular profiling of non-small cell lung cancer and correlation with disease-free survival

    Cancer Res.

    (2002)
  • C.L. Nutt et al.

    Gene expression-based classification of malignant gliomas correlates better with survival than histological classification

    Cancer Res.

    (2003)
  • R. Blagus et al.

    Class prediction for high-dimensional class-imbalanced data

    BMC Bioinf.

    (2010)
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • N. Japkowicz, Workshop on learning from imbalanced data sets, in: Proceedings of the 17th American Association for...
  • N.V. Chawla, N. Japkowicz, A. Kolcz, Workshop on learning from imbalanced data sets II, in: Proceedings of the 20th...
  • N.V. Chawla et al.

    Editorial: special issue on learning from imbalanced data sets

    ACM SIGKDD Explor. Newsl

    (2004)
  • C. Ling, C. Li, Data mining for direct marketing problems and solutions, in: Proceedings of the 4th ACM SIGKDD...
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in:...
  • M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the 14th...
  • H. He, Y. Bai, E.A. Garcia, et al., ADASYN: adaptive synthetic sampling approach for imbalanced learning, in:...
  • C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th International Joint Conference of...
  • B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost-proportionate example weighting, in: Proceedings of...
  • Cited by (0)

    Hualong Yu received the B.S. degree from Heilongjiang University, Harbin, China, in 2005. Then he received M.S. and Ph.D degree from the college of computer science and technology, Harbin Engineering University, Harbin, China, in 2008 and 2010, respectively. Since 2010, he has been one lecturer and master supervisor in Jiangsu University of Science and Technology, Zhenjiang, China. He is author or co-author for over 20 research papers, 3 books and the program committee member for ICICSE2012. His research interests mainly include pattern recognition, machine learning and Bioinformatics, etc.

    Jun Ni received the B.S. degree from Harbin Engineering University, Harbin, China, the M.S. degree from Shanghai Jiaotong University, Shanghai, China and the Ph.D degree from the University of Iowa, IA, USA. He is currently an associate professor and director of Medical Imaging HPC and Informatics Lab, Department of Radiology, Carver College of Medicine, the University of Iowa, Iowa City, IA, USA. He is also visiting professor in Harbin Engineering University and Shanghai University, China, since 2006 and 2009, respectively. He edited or co-edited 34 books or proceedings and authored or co-authored 115 peer-reviewed journal and conference papers. In addition, he is editor-in-chief of International Journal of Computational Medicine and Healthcare, associate editor of IEEE Systems Journal and editorial board member for 15 other professional journals. Since 2003, he has also been General/Program Chairs for over 50 International conferences. Currently, his research interests include distributed computation, parallel computing, medical imaging informatics, computational biology and Bioinformatics, etc.

    Jing Zhao received the Ph.D degree in Harbin Institute of Technology, Harbin, China, in 2005. She is currently a professor and Ph.D supervisor in college of computer science and technology, Harbin Engineering University, Harbin, China and a senior visiting scholar in Duke University, USA. She is author or co-author for over 20 research papers. Her research interests include software reliability, mobile computing and image processing.

    View full text