Genetic programming in classifying large-scale data: an ensemble method

https://doi.org/10.1016/j.ins.2003.03.028Get rights and content

Abstract

This study demonstrated potential of genetic programming (GP) as a base classifier algorithm in building ensembles in the context of large-scale data classification. An ensemble built upon base classifiers that were trained with GP was found to significantly outperform its counterparts built upon base classifiers that were trained with decision tree and logistic regression. The superiority of GP ensemble was partly attributed to the higher diversity, both in terms of the functional form of as well as with respect to the variables defining the models, among the base classifiers upon which it was built on. Implications of GP as a useful tool in other data mining problems, such as feature selection, were also discussed.

Introduction

The classification problem of assigning observations into one of several groups plays a key role in decision-making. The binary classification problem, where the data are restricted to one of two groups, has wide applicability in problems ranging from credit scoring, default prediction and direct marketing to applications in biology and medical domains. It has been studied extensively by statisticians, and in more recent years a number of machine learning and data mining approaches have been proposed. Amongst the latter group of techniques are those that can broadly be categorized as “soft computing” methods, noted to tradeoff optimality and precision for advantages of representational power and applicability under wider conditions. The form of the solution varies with the technique employed, and is expressed in some form of classification rule or function based on the multivariate data defining each observation.

Techniques from the statistics realm beginning with Fisher's seminal work include linear, quadratic and logistic discriminant models, and are amongst the most commonly used [1]. Logistic regression remains, even today, amongst the most widely used of data mining methods in industry. The major drawback with these methods arise from the fact that real-world data often do not satisfy their underlying assumptions. Non-parametric methods are less restrictive, and popular amongst them are the k-nearest neighbor and various mathematical programming methods [2], [3]. Bradley et al. [4] provide a comprehensive review of various mathematical programming approaches for data mining. Various inductive learning approaches have been developed for classification, including decision trees, neural networks, support vector machines. More recently, evolutionary computation techniques like genetic algorithms have shown to be attractive for classification and data mining.

Genetic algorithm (GA), modeled loosely on principles of natural evolution, provide a powerful general-purpose search approach, and have been noted to offer advantages for data mining. These advantages stem from the representational flexibility allowed on the form that a classification function may take, and from the largely open formulation of the search objective (fitness function). Classification functions studied range from a set of data-attribute weights as in traditional regression models [5], [6], condition-action type rules with conjunction/disjunction of terms [7], [8], to the parse-tree expressions of genetic programming [9]. The flexibility in fitness function formulation allows the development of classification models that are tailored to specific domain constraints and objectives. Bhattacharyya [6], [10], for example, shows how GA-based models developed with the fitness function explicitly seeking the direct marketing objective of maximizing lifts at specific mailing-depths, can yield significant improvements over the traditional classification objective of minimizing errors. Various recent studies report on the application of evolutionary search in data mining [11], [12], [13].

Given the massive amounts of data accumulated by organizations today, a key challenge facing the data-mining field is the development of methods for large-scale data mining. A focus on problems arising from large data is, in fact, a defining characteristic of data mining (being defined as the process of gleaning useful and actionable knowledge from large data [14]). Most traditional classification methods, however, require the simultaneous use of all the training data, thereby posing a severe challenge in their direct application to large data problems––it is, after all, desirable that all available data be used in developing the classifier, even when all the data cannot be accommodated in memory. Further, many learning algorithms are of an iterative nature, necessitating several passes through the data––attempting to develop models on the entire data may then take too long. Further, it may not always be possible to have all the data at a same place at the same time due to, for example, security or privacy reasons, or because the data may be present in naturally distributed settings. In short, traditional classification methods are not readily amenable to handling large-scale distributed data.

In recent years, a number of approaches have been suggested for overcoming such large-data limitations. While some seek to incrementally develop a model from subsets of the data [15], others obtain classification from a team or ensemble of classifiers. Various studies have shown that ensembling is a useful method addressing the problem of large-scale and potentially distributed data [16], [17], [18], [19]. Ideally, different base classifiers in an ensemble capture different patterns or aspects of a pattern embedded in the whole data. And then through ensembling, these different patterns or aspects are incorporated into a final prediction. A number of studies have examined techniques like bagging and boosting for combining predictions form multiple classifiers, and these have been shown to improve the performance of ensembles over individual classifiers. Comparative evaluations of different variants of these are given in [20], [21]. Note that data may be `large' not only from the perspective of number of observations, but also in carrying a large number of attributes. Various dimensionality reduction and feature selection methods are applicable for this. The focus of this paper is on the classification of data having a large number of observations, and the term large-scale is used only is this context.

This paper proposes the use of genetic programming (GP) as a technique for developing base classifiers for an ensemble. A key advantage of GP arises from the ability of its parse-tree representation to model almost arbitrary non-linear forms. A GP model can thus seek to capture various non-linear relationships in the data. The genetic search procedure is also ideally suited to explore the large, and, given the usually scant prior knowledge on data relationships, often poorly understood search spaces that typify real-life data-mining problems. Different models developed on data subsets can thus incorporate a diversity of data-patterns, which can then be combined as an ensemble for final classification. The utility of GP as an ensembling approach is examined using a real-life large dataset of about five million observations from the KDD-Cup 1999 competition.1 The experimental study compares the performance of GP-ensembling to ensembles based on decision trees as well as logistic regression. Results demonstrate the advantages of the proposed GP based ensembling scheme. As expected, the GP models are found to capture diverse patterns in the data, while the developed decision tree models look similar. The ensemble of GP models is seen to yield significantly superior performance over the decision tree ensemble. Our findings also show that GP is able to discern data patterns using a small subset of variables in the data, indicating its utility for feature selection and exploratory data analysis.

The paper is organized as follows: the next section provides a brief background on genetic search and the non-linear representations of genetic programming, and presents some relevant literature on ensembling approaches for data mining. Section 3 details the data used for the experimental study, and presents the experiments and results. Concluding remarks and future research directions are given in the last section.

Section snippets

Ensembles for large-scale classification

The notion of aggregating information from multiple models has been addressed in the statistics and decision-making literature (see, for example [22], [23], [24], mostly in the context of combining forecasts from multiple models based on different methods, but using the same data. For data mining on large-scale data, the focus is on using smaller data subsets so as to ease the computational burden. The essence of ensembling is to take small subsets from a large data set to build a number of

Data and experiments

The dataset used in this study is from KDD Cup'99 competition, which consists of connection data of a simulated US Air Force LAN. The dataset is large, with training dataset containing about five million examples, and test dataset about 300,000 examples. Each example is labeled “normal” or one of 25 attack types. Since multi-class classification is not the focus of this study, different attack types are not differentiated. That is, each example is only classified as “normal” or “attack”.

Conclusion

This study demonstrates the utility of GP as a method for creating and using ensembles in data mining. Given its representational power in modeling complex non-linearities in the data, GP is seen to be effective at learning diverse patterns in the data. With different models capturing varied data relationships, GP models are ideally suited for combination in ensembles. Experimental results show that different GP models are dissimilar both in terms of the functional form as well as with respect

References (43)

  • R. Clemen

    Combining forecasts: a review and annotated bibliography

    Journal of Forecasting

    (1989)
  • R.A. Fisher

    The use of multiple measurements in taxonomic problems

    Annals of Eugenics

    (1936)
  • N. Freed et al.

    A linear programming approach to the discriminant problem

    Decision Sciences

    (1981)
  • D.-J. Hand

    Discrimination and Classification

    (1981)
  • P.S. Bradley et al.

    Mathematical programming for data mining: formulations and challenges

    INFORMS Journal on Computing

    (1999)
  • G.J. Koehler

    Linear discriminant functions determined through genetic search

    ORSA Journal on Computing

    (1991)
  • S. Bhattacharyya

    Direct marketing performance modeling using genetic algorithms

    INFORMS Journal on Computing

    (1999)
  • D.-P. Greene et al.

    A genetic system for learning models of consumer choice

  • K. DeJong et al.

    Using genetic algorithms for concept learning

    Machine Learning

    (1993)
  • J.R. Koza

    Genetic Programming: On the Programming of Computers by Means of Natural Selection

    (1992)
  • S. Bhattacharyya

    Direct marketing response models using genetic search

    Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York

    (1998)
  • Y. Kim, W. Street, F. Menczer, Feature selection in unsupervised learning via evolutionary search, in: Proceedings of...
  • Y. Kim, W. Street, F. Menczer, An evolutionary multi-objective local selection algorithm for customer targeting, in:...
  • S. Bhattacharyya

    Evolutionary algorithms in data mining: multi-objective performance modeling for direct marketing

  • U.M. Fayyad et al.

    Advances in Knowledge Discovery and Data Mining

    (1996)
  • P. Domingos et al.

    Mining high-speed data streams

  • L. Breiman

    Pasting bites together for prediction in large data sets and on-line

    Machine Learning

    (1999)
  • P. Chan et al.

    On the accuracy of meta-learning for scalable data mining

    Journal of Intelligent Information Systems

    (1997)
  • W. Street, Y. Kim, A streaming ensemble algorithm (SEA) for large-scale classification, in: Proceedings of the Seventh...
  • L. Hall et al.

    Distributed learning on very large data sets

  • E. Bauer et al.

    An empirical comparison of voting classification algorithms: bagging, boosting, and variants

    Machine Learning

    (1999)
  • Cited by (74)

    • Social media optimization: Identifying an optimal strategy for increasing network size on Facebook

      2016, Omega (United Kingdom)
      Citation Excerpt :

      Fourth, selection of chromosomes takes place for reproduction such that the chromosomes with the highest fitness have the highest propensity of being selected. Finally, the chosen chromosomes will produce offspring using the crossover and mutation operator [59]. To build and test the models we split the data set into three equal parts: training, validation and test set.

    View all citing articles on Scopus
    View full text