Genetic programming in classifying large-scale data: an ensemble method

doi:10.1016/j.ins.2003.03.028

Information Sciences

Volume 163, Issues 1–3, 14 June 2004, Pages 85-101

https://doi.org/10.1016/j.ins.2003.03.028 Get rights and content

Abstract

This study demonstrated potential of genetic programming (GP) as a base classifier algorithm in building ensembles in the context of large-scale data classification. An ensemble built upon base classifiers that were trained with GP was found to significantly outperform its counterparts built upon base classifiers that were trained with decision tree and logistic regression. The superiority of GP ensemble was partly attributed to the higher diversity, both in terms of the functional form of as well as with respect to the variables defining the models, among the base classifiers upon which it was built on. Implications of GP as a useful tool in other data mining problems, such as feature selection, were also discussed.

Introduction

The classification problem of assigning observations into one of several groups plays a key role in decision-making. The binary classification problem, where the data are restricted to one of two groups, has wide applicability in problems ranging from credit scoring, default prediction and direct marketing to applications in biology and medical domains. It has been studied extensively by statisticians, and in more recent years a number of machine learning and data mining approaches have been proposed. Amongst the latter group of techniques are those that can broadly be categorized as “soft computing” methods, noted to tradeoff optimality and precision for advantages of representational power and applicability under wider conditions. The form of the solution varies with the technique employed, and is expressed in some form of classification rule or function based on the multivariate data defining each observation.

Techniques from the statistics realm beginning with Fisher's seminal work include linear, quadratic and logistic discriminant models, and are amongst the most commonly used [1]. Logistic regression remains, even today, amongst the most widely used of data mining methods in industry. The major drawback with these methods arise from the fact that real-world data often do not satisfy their underlying assumptions. Non-parametric methods are less restrictive, and popular amongst them are the k-nearest neighbor and various mathematical programming methods [2], [3]. Bradley et al. [4] provide a comprehensive review of various mathematical programming approaches for data mining. Various inductive learning approaches have been developed for classification, including decision trees, neural networks, support vector machines. More recently, evolutionary computation techniques like genetic algorithms have shown to be attractive for classification and data mining.

Genetic algorithm (GA), modeled loosely on principles of natural evolution, provide a powerful general-purpose search approach, and have been noted to offer advantages for data mining. These advantages stem from the representational flexibility allowed on the form that a classification function may take, and from the largely open formulation of the search objective (fitness function). Classification functions studied range from a set of data-attribute weights as in traditional regression models [5], [6], condition-action type rules with conjunction/disjunction of terms [7], [8], to the parse-tree expressions of genetic programming [9]. The flexibility in fitness function formulation allows the development of classification models that are tailored to specific domain constraints and objectives. Bhattacharyya [6], [10], for example, shows how GA-based models developed with the fitness function explicitly seeking the direct marketing objective of maximizing lifts at specific mailing-depths, can yield significant improvements over the traditional classification objective of minimizing errors. Various recent studies report on the application of evolutionary search in data mining [11], [12], [13].

Given the massive amounts of data accumulated by organizations today, a key challenge facing the data-mining field is the development of methods for large-scale data mining. A focus on problems arising from large data is, in fact, a defining characteristic of data mining (being defined as the process of gleaning useful and actionable knowledge from large data [14]). Most traditional classification methods, however, require the simultaneous use of all the training data, thereby posing a severe challenge in their direct application to large data problems––it is, after all, desirable that all available data be used in developing the classifier, even when all the data cannot be accommodated in memory. Further, many learning algorithms are of an iterative nature, necessitating several passes through the data––attempting to develop models on the entire data may then take too long. Further, it may not always be possible to have all the data at a same place at the same time due to, for example, security or privacy reasons, or because the data may be present in naturally distributed settings. In short, traditional classification methods are not readily amenable to handling large-scale distributed data.

In recent years, a number of approaches have been suggested for overcoming such large-data limitations. While some seek to incrementally develop a model from subsets of the data [15], others obtain classification from a team or ensemble of classifiers. Various studies have shown that ensembling is a useful method addressing the problem of large-scale and potentially distributed data [16], [17], [18], [19]. Ideally, different base classifiers in an ensemble capture different patterns or aspects of a pattern embedded in the whole data. And then through ensembling, these different patterns or aspects are incorporated into a final prediction. A number of studies have examined techniques like bagging and boosting for combining predictions form multiple classifiers, and these have been shown to improve the performance of ensembles over individual classifiers. Comparative evaluations of different variants of these are given in [20], [21]. Note that data may be `large' not only from the perspective of number of observations, but also in carrying a large number of attributes. Various dimensionality reduction and feature selection methods are applicable for this. The focus of this paper is on the classification of data having a large number of observations, and the term large-scale is used only is this context.

This paper proposes the use of genetic programming (GP) as a technique for developing base classifiers for an ensemble. A key advantage of GP arises from the ability of its parse-tree representation to model almost arbitrary non-linear forms. A GP model can thus seek to capture various non-linear relationships in the data. The genetic search procedure is also ideally suited to explore the large, and, given the usually scant prior knowledge on data relationships, often poorly understood search spaces that typify real-life data-mining problems. Different models developed on data subsets can thus incorporate a diversity of data-patterns, which can then be combined as an ensemble for final classification. The utility of GP as an ensembling approach is examined using a real-life large dataset of about five million observations from the KDD-Cup 1999 competition.¹ The experimental study compares the performance of GP-ensembling to ensembles based on decision trees as well as logistic regression. Results demonstrate the advantages of the proposed GP based ensembling scheme. As expected, the GP models are found to capture diverse patterns in the data, while the developed decision tree models look similar. The ensemble of GP models is seen to yield significantly superior performance over the decision tree ensemble. Our findings also show that GP is able to discern data patterns using a small subset of variables in the data, indicating its utility for feature selection and exploratory data analysis.

The paper is organized as follows: the next section provides a brief background on genetic search and the non-linear representations of genetic programming, and presents some relevant literature on ensembling approaches for data mining. Section 3 details the data used for the experimental study, and presents the experiments and results. Concluding remarks and future research directions are given in the last section.

Section snippets

Ensembles for large-scale classification

The notion of aggregating information from multiple models has been addressed in the statistics and decision-making literature (see, for example [22], [23], [24], mostly in the context of combining forecasts from multiple models based on different methods, but using the same data. For data mining on large-scale data, the focus is on using smaller data subsets so as to ease the computational burden. The essence of ensembling is to take small subsets from a large data set to build a number of

Data and experiments

The dataset used in this study is from KDD Cup'99 competition, which consists of connection data of a simulated US Air Force LAN. The dataset is large, with training dataset containing about five million examples, and test dataset about 300,000 examples. Each example is labeled “normal” or one of 25 attack types. Since multi-class classification is not the focus of this study, different attack types are not differentiated. That is, each example is only classified as “normal” or “attack”.

Conclusion

This study demonstrates the utility of GP as a method for creating and using ensembles in data mining. Given its representational power in modeling complex non-linearities in the data, GP is seen to be effective at learning diverse patterns in the data. With different models capturing varied data relationships, GP models are ideally suited for combination in ensembles. Experimental results show that different GP models are dissimilar both in terms of the functional form as well as with respect

References (43)

R. Clemen
Combining forecasts: a review and annotated bibliography
Journal of Forecasting
(1989)
R.A. Fisher
The use of multiple measurements in taxonomic problems
Annals of Eugenics
(1936)
N. Freed et al.
A linear programming approach to the discriminant problem
Decision Sciences
(1981)
D.-J. Hand
Discrimination and Classification
(1981)
P.S. Bradley et al.
Mathematical programming for data mining: formulations and challenges
INFORMS Journal on Computing
(1999)
G.J. Koehler
Linear discriminant functions determined through genetic search
ORSA Journal on Computing
(1991)
S. Bhattacharyya
Direct marketing performance modeling using genetic algorithms
INFORMS Journal on Computing
(1999)
D.-P. Greene et al.
A genetic system for learning models of consumer choice
K. DeJong et al.
Using genetic algorithms for concept learning
Machine Learning
(1993)
J.R. Koza
Genetic Programming: On the Programming of Computers by Means of Natural Selection
(1992)

S. Bhattacharyya

Direct marketing response models using genetic search

Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York

(1998)

Y. Kim, W. Street, F. Menczer, Feature selection in unsupervised learning via evolutionary search, in: Proceedings of...

Y. Kim, W. Street, F. Menczer, An evolutionary multi-objective local selection algorithm for customer targeting, in:...

S. Bhattacharyya

Evolutionary algorithms in data mining: multi-objective performance modeling for direct marketing

U.M. Fayyad et al.

Advances in Knowledge Discovery and Data Mining

(1996)

P. Domingos et al.

Mining high-speed data streams

L. Breiman

Pasting bites together for prediction in large data sets and on-line

Machine Learning

(1999)

P. Chan et al.

On the accuracy of meta-learning for scalable data mining

Journal of Intelligent Information Systems

(1997)

W. Street, Y. Kim, A streaming ensemble algorithm (SEA) for large-scale classification, in: Proceedings of the Seventh...

L. Hall et al.

Distributed learning on very large data sets

E. Bauer et al.

An empirical comparison of voting classification algorithms: bagging, boosting, and variants

Machine Learning

(1999)

Cited by (74)

Echo state network-enhanced symbolic regression for spatio-temporal binary stochastic cellular automata
2024, Spatial Statistics
Binary spatio-temporal data are common in many application areas. Such data can be considered from many perspectives, including via deterministic or stochastic cellular automata (CA), where local rules govern the transition probabilities that describe the evolution of the 0 and 1 states across space and time. One implementation of a stochastic CA for such data is via a spatio-temporal generalized linear model (or mixed model), with the local rule covariates being included in the transformed mean response. However, in many applications we do have a complete understanding of the local rules and must instead explore the rules space, which can be accomplished through symbolic regression. Even with a learned rule space, the data-driven rules may be insufficient to describe the process behavior and it is helpful to augment the transformed linear predictor with a latent spatio-temporal dynamic process. Here, we demonstrate for the first time that an echo state network (ESN) latent process can be used to enhance symbolic regression-learned local rule covariates. We implement this in a hierarchical Bayesian framework with regularized horseshoe priors on the ESN output weight matrices, which extends the ESN literature as well. Finally, we gain added expressiveness from the ESNs by considering an ensemble of ESN reservoirs, which we accommodate through weighted model averaging, which is also new to the ESN literature. We demonstrate our methodology on a simulated process in which we assume we do not know all of the local CA rules, as well as on multiple environmental data sets.
Microstructural evolution and constitutive models to predict hot deformation behaviors of a nickel-based superalloy
2017, Vacuum
To investigate the hot deformation behaviors of a nickel-based superalloy, the hot compressive tests are conducted at the deformation temperature range of 920–1040 °C and strain rate range of 0.001-1s⁻¹. It is found that the effects of strain rate and deformation temperature on the grain boundary maps are significant. An almost competed dynamic recrystallization (DRX) microstructure occurs at relatively low strain rates. However, the increased strain rate easily leads to the uneven microstructures. The DRX degree notably increases with the increase of deformation temperature, because the high temperature enhances the grain boundary migration mobility and facilitates the nucleation and growth of DRX grains. Based on the experimental results, multi-gene genetic programming (MGGP), artificial neural network (ANN) and Arrhenius type phenomenological models are established to predict the flow stress. Due to the obvious over-fitting problem of MGGP model, a Hannan-Quinn information criterion based MGGP (HQC-MGGP) approach is proposed. The performances of MGGP, HQC-MGGP, ANN and phenomenological models are compared. It is found that HQC-MGGP model has the best performance to predict the flow stress under the experimental conditions. Therefore, HQC-MGGP model is accurate and reliable in describing the hot deformation behaviors of the studied nickel-based superalloy.
Social media optimization: Identifying an optimal strategy for increasing network size on Facebook
2016, Omega (United Kingdom)
Citation Excerpt :
Fourth, selection of chromosomes takes place for reproduction such that the chromosomes with the highest fitness have the highest propensity of being selected. Finally, the chosen chromosomes will produce offspring using the crossover and mutation operator [59]. To build and test the models we split the data set into three equal parts: training, validation and test set.
This paper aims to create an expert system that yields an optimal strategy for increasing network size on Facebook. Data were obtained from 5488 Facebook users by means of a custom-built Facebook application. We computed a total of 426 variables. Using these data we estimated a predictive model of network size which is subsequently used in a prescriptive model. The former is estimated with Random Forest and the latter with a Genetic Algorithm. The results indicate that the proposed expert system can identify an optimal social media strategy. The system delivers concrete recommendations about, for example, the optimal time between status updates. The analysis reveals that network size can be increased by 61% if the optimal strategy is adopted. This study contributes to literature in the following two ways. First it devises a novel prescriptive social media expert system relying on an unprecedented variety of social media data. The results indicate that the system is effective and a viable strategic tool for increasing network size. Second it provides a list of the top drivers allowing future research to build similar systems efficiently.
Participation-based student final performance prediction model through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory
2015, Computers in Human Behavior
Building a student performance prediction model that is both practical and understandable for users is a challenging task fraught with confounding factors to collect and measure. Most current prediction models are difficult for teachers to interpret. This poses significant problems for model use (e.g. personalizing education and intervention) as well as model evaluation. In this paper, we synthesize learning analytics approaches, educational data mining (EDM) and HCI theory to explore the development of more usable prediction models and prediction model representations using data from a collaborative geometry problem solving environment: Virtual Math Teams with Geogebra (VMTwG). First, based on theory proposed by Hrastinski (2009) establishing online learning as online participation, we operationalized activity theory to holistically quantify students’ participation in the CSCL (Computer-supported Collaborative Learning) course. As a result, 6 variables, Subject, Rules, Tools, Division of Labor, Community, and Object, are constructed. This analysis of variables prior to the application of a model distinguishes our approach from prior approaches (feature selection, Ad-hoc guesswork etc.). The approach described diminishes data dimensionality and systematically contextualizes data in a semantic background. Secondly, an advanced modeling technique, Genetic Programming (GP), underlies the developed prediction model. We demonstrate how connecting the structure of VMTwG trace data to a theoretical framework and processing that data using the GP algorithmic approach outperforms traditional models in prediction rate and interpretability. Theoretical and practical implications are then discussed.
Bagging-based ensemble classifiers using multiobjective Genetic Programming
2024, Research Square
Performance Assessment of Logistic Regression (LR), Artificial Neural Network (ANN), Fuzzy Inference System (FIS) and Adaptive Neuro-Fuzzy System (ANFIS) in Predicting Default Probability: The Case of a Tunisian Islamic Bank
2023, Computational Economics

View all citing articles on Scopus

View full text

Genetic programming in classifying large-scale data: an ensemble method

Abstract

Introduction

Section snippets

Ensembles for large-scale classification

Data and experiments

Conclusion

Journal of Forecasting

The use of multiple measurements in taxonomic problems

Annals of Eugenics

A linear programming approach to the discriminant problem

Decision Sciences

Discrimination and Classification

Mathematical programming for data mining: formulations and challenges

INFORMS Journal on Computing

Linear discriminant functions determined through genetic search

ORSA Journal on Computing

Direct marketing performance modeling using genetic algorithms

INFORMS Journal on Computing

A genetic system for learning models of consumer choice

Using genetic algorithms for concept learning

Machine Learning

Genetic Programming: On the Programming of Computers by Means of Natural Selection

Direct marketing response models using genetic search

Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York

Evolutionary algorithms in data mining: multi-objective performance modeling for direct marketing

Advances in Knowledge Discovery and Data Mining

Mining high-speed data streams

Pasting bites together for prediction in large data sets and on-line

Machine Learning

On the accuracy of meta-learning for scalable data mining

Journal of Intelligent Information Systems

Distributed learning on very large data sets

An empirical comparison of voting classification algorithms: bagging, boosting, and variants

Machine Learning