ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data
Highlights
► ACO algorithm is modified for undersampling skewed DNA microarray data. ► The significance of each majority sample is estimated by ranking frequence list. ► ACOSampling increases classification performance but spends more time. ► Selecting a few feature genes helps to improve classification performance. ► Some classification tasks are harmful and the others are unharmful.
Introduction
In the past decade, DNA microarray has been one of the most important molecular biology technologies in the post-genomic era. By this technology, biologists and medical experts are permitted to detect the activity of thousands of genes in a cell simultaneously. At present, DNA microarray has been widely applied to predict gene functions [1], investigate gene regulatory mechanisms [2], [3], provide invaluable information for drug discovery [4], classify for cancer [5], [6] and mining new subtypes of a specific tumor [7], [8], [9] etc. Among these applications, cancer classification has attracted more attentions. However, it is well-known that microarray data generally has some particular features, such as high-dimension, small sample, high noise and most importantly, imbalanced class distributions. Skewed class distributions will underestimate greatly the prediction performance for minority classes and provide inaccurate evaluation for classification performance, while the other features of microarray data will further intensify this damage [10]. Therefore, it is necessary to remedy this bias by some effective strategies.
In fact, class imbalance learning has drawn a significant amount of interest since 2000 from artificial intelligence, data mining and machine learning, which can be reflected by launch of several major workshops and special issues [11], including AAAI’00 [12], ICML’03 [13] and ACM SIGKDD Explorations’04 [14] etc. There are two major methods to solve class imbalance problem: sampling-based strategy and cost sensitive learning. Sampling, which includes oversampling and undersampling, deals with class imbalance by inserting samples for minority class or discarding samples of majority class [15], [16], [17], [18], [19], [20]. While cost-sensitive learning treats class imbalance by incurring different costs for different classes [21], [22], [23], [24], [25], [26], [27], [28], [29]. Recently, some research also focused on ensemble learning built on multiple different sampling or weighting data sets with presenting excellent performance and generalization ability [30], [31], [32], [33], [34], [35]. More details about class imbalance learning methods are presented in Section 2.
In this study, we introduce a novel undersampling method based on the idea of ant colony optimization (ACO), which is named ACOSampling, to classify for skewed DNA microarray data. In fact, this method is a modified version of our previous work [36], the difference between them is this work converts the information selection from feature space to sample space. First, the original training dataset is randomly and repeatedly divided into two groups: training dataset and validation dataset. Then for each partition, ACOSampling is conducted to find the corresponding optimal majority class sample subset. Different from the traditional ACO algorithm, ACOSampling impels ants to leave from the nest, then to pass all majority class samples one by one, by either pathway 0 or pathway 1, at last to reach the food source, where pathway 0 indicates the corresponding sample is useless and should be filtered, while pathway 1 represents it is important and should be selected. Considering the particularity of the classification tasks in this study, the overall accuracy is not an excellent measure as the fitness function, thus we construct it by three weighted indicative metrics, namely F-measure, G-mean and AUC, respectively. After that, many local optimal majority class sample subsets can be generated by iterative partitions, so the significance of each majority sample may be estimated according to its selection frequence, i.e., the higher the selection frequence, the more information the corresponding sample can provide. Next, a global optimum balanced sample set can be created by combining the highly ranked samples of majority class with all examples of minority class. At last, we construct a SVM classifier upon the balanced training set for recognizing future unlabeled samples.
The remainder of this paper is organized as follows. Section 2 reviews some previous work related with class imbalance problem. In Section 3, the idea and procedure of ACOSampling method is described in detail. Experimental results and discussions are presented in Section 4. At last, we conclude this paper in Section 5.
Section snippets
Previous work
As mentioned in the Section 1, the existing class imbalance learning methods could be roughly categorized into two major groups: sampling strategy and cost sensitive learning. Here, we pay special attention to sampling strategy because it is more related with our study.
The sampling is actually a re-balancing process for the given imbalanced data set. It can be distinguished into oversampling and undersampling. Oversampling, as its name indicates, increases some samples belonging to minority
Undersampling based on ant colony optimization
Ant colony optimization (ACO) algorithm, which is developed by Colorni et al. [38], is one important member of swarm intelligence family. ACO simulates the behavior of foraging by real ant colony and in recent years, it has been successfully applied to solve various practical optimization problems, including TSP [39], parameter optimization [40], path planning [41], protein folding [42] etc.
In previous work, we have once designed an ACO algorithm to select tumor-related marker genes in DNA
Datasets
Four benchmark imbalanced microarray datasets are used in our experiments, including Colon dataset [5], CNS (Central Neural System) dataset [54], Lung cancer dataset [8] and Glioma dataset [9]. The first three datasets are composed of binary class samples and Glioma dataset consists of four subtypes: cancer glioblastomas (CG), non-cancer glioblastomas (NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas (NO). For Glioma dataset, CG is used as the positive class with 14
Conclusions
In this paper, we present a novel heuristic undersampling method named as ACOSampling to address imbalanced DNA microarray data classification problem. By extensive experiments, it has demonstrated that the proposed method is effective and can automatically extract those so-called “information samples” from majority class. However, since its procedure of sampling is more time-consuming than those typical sampling approaches, it will be more efficient on small sample classification tasks.
Acknowledgements
This work is partially supported by National Natural Science Foundation of China under grant No.60873036, the Ph.D Programs Foundation for young researchers of Ministry of Education of China under Grant No.20070217051 and Ph.D Foundation of Jiangsu University of Science and Technology under Grant No.35301002.
Hualong Yu received the B.S. degree from Heilongjiang University, Harbin, China, in 2005. Then he received M.S. and Ph.D degree from the college of computer science and technology, Harbin Engineering University, Harbin, China, in 2008 and 2010, respectively. Since 2010, he has been one lecturer and master supervisor in Jiangsu University of Science and Technology, Zhenjiang, China. He is author or co-author for over 20 research papers, 3 books and the program committee member for ICICSE2012.
References (56)
- et al.
Cluster-based under-sampling approaches for imbalanced data distributions
Expert. Syst. Appl.
(2009) - et al.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognit.
(2007) - et al.
A modified ant colony optimization algorithm for tumor marker gene selection
Genomics Proteomics Bioinf.
(2009) - et al.
An interactive simulation and analysis software for solving TSP using ant colony optimization algorithms
Adv. Eng. Software
(2009) - et al.
An ACO-based algorithm for parameter optimization of support vector machines
Expert. Syst. Appl.
(2010) - et al.
Three-dimension path planning for UCAV using hybrid meta-heuristic ACO-DE algorithm
Simul. Modell. Pract. Theory
(2010) - et al.
Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification
Comput. Biol. Med.
(2009) - et al.
From the cover: transitive functional annotation by shortest-path analysis of gene expression data
Proc. Nat. Acad. Sci. U.S.A.
(2002) Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks
Bioinformatics
(2003)- et al.
Module networks: discovering regulatory modules and their condition specific regulators from gene expression data
Nat. Genet.
(2003)
Gene expression as a drug discovery tool
Nat. Genet.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by Oligonucleotide array
Proc. Nat. Acad. Sci. U.S.A.
Cancer diagnosis using proteomic patterns
Expert Rev. Mol. Diagn.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Molecular profiling of non-small cell lung cancer and correlation with disease-free survival
Cancer Res.
Gene expression-based classification of malignant gliomas correlates better with survival than histological classification
Cancer Res.
Class prediction for high-dimensional class-imbalanced data
BMC Bioinf.
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explor. Newsl
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Cited by (0)
Hualong Yu received the B.S. degree from Heilongjiang University, Harbin, China, in 2005. Then he received M.S. and Ph.D degree from the college of computer science and technology, Harbin Engineering University, Harbin, China, in 2008 and 2010, respectively. Since 2010, he has been one lecturer and master supervisor in Jiangsu University of Science and Technology, Zhenjiang, China. He is author or co-author for over 20 research papers, 3 books and the program committee member for ICICSE2012. His research interests mainly include pattern recognition, machine learning and Bioinformatics, etc.
Jun Ni received the B.S. degree from Harbin Engineering University, Harbin, China, the M.S. degree from Shanghai Jiaotong University, Shanghai, China and the Ph.D degree from the University of Iowa, IA, USA. He is currently an associate professor and director of Medical Imaging HPC and Informatics Lab, Department of Radiology, Carver College of Medicine, the University of Iowa, Iowa City, IA, USA. He is also visiting professor in Harbin Engineering University and Shanghai University, China, since 2006 and 2009, respectively. He edited or co-edited 34 books or proceedings and authored or co-authored 115 peer-reviewed journal and conference papers. In addition, he is editor-in-chief of International Journal of Computational Medicine and Healthcare, associate editor of IEEE Systems Journal and editorial board member for 15 other professional journals. Since 2003, he has also been General/Program Chairs for over 50 International conferences. Currently, his research interests include distributed computation, parallel computing, medical imaging informatics, computational biology and Bioinformatics, etc.
Jing Zhao received the Ph.D degree in Harbin Institute of Technology, Harbin, China, in 2005. She is currently a professor and Ph.D supervisor in college of computer science and technology, Harbin Engineering University, Harbin, China and a senior visiting scholar in Duke University, USA. She is author or co-author for over 20 research papers. Her research interests include software reliability, mobile computing and image processing.