Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

doi:10.1016/j.neucom.2015.10.109

Neurocomputing

Volume 180, 5 March 2016, Pages 79-93

https://doi.org/10.1016/j.neucom.2015.10.109 Get rights and content

Highlights

•
We studied tournament procedures for bloat control in LGP.
•
We proposed seven variations of known tournament procedures.
•
The variations were tested in 35 symbolic regression problems.
•
The proposed procedures show competitive or better solution quality than the original procedure, but with smaller solutions.

Abstract

Linear Genetic Programming (LGP) is an Evolutionary Computation algorithm, inspired in the Genetic Programming (GP) algorithm. Instead of using the standard tree representation of GP, LGP evolves a linear program, which causes a graph-based data flow with code reuse. LGP has been shown to outperform GP in several problems, including Symbolic Regression (SReg), and to produce simpler solutions. In this paper, we propose several LGP variants and compare them with a traditional LGP algorithm on a set of benchmark SReg functions from the literature. The main objectives of the variants were to both control bloat and privilege useful code in the population. Here we evaluate their effects during the evolution process and in the quality of the final solutions. Analysis of the results showed that bloat control and effective code maintenance worked, but they did not guarantee improvement in solution quality.

Introduction

Regression analysis is a statistical process of estimating relationships among variables and using that information to perform predictions. Usually, there is a set of independent explanatory variables and a single dependent response variable. Regression techniques are used to build a model that allows the correct/approximate prediction of the response variable using the explanatory variables as input. In fact, a model is assumed a priori, determined by the researcher, and the regression technique estimates the coefficients that minimize a certain criterion, for instance, the least-squared error. On the other hand, symbolic regression (SReg) consists of, without the a priori model, automatically finding a mathematical expression that adjusts as best as possible to these data [15]. That is, given the explanatory and response variables, an SReg technique has to find both the model and the coefficients that minimize it. Therefore, SReg is a harder problem than regular regression. Since one wishes to find the best model, a global optimization technique is supposed to be used, and Evolutionary Computation methods are widely employed for this task.

In order to solve SReg problems, Koza [15] proposed the Genetic Programming (GP) technique, an extension of the Genetic Algorithm (GA) technique [27]. While GA operates on a fixed-length array that is a codification of a problem׳s solution, GP manipulates a variable-length tree data-structure, that is a code to be used to solve the problem. Thus, techniques such as GP and similar approaches are proper for solving SReg problems and have been widely investigated [26], [22], [13], [10], [33], [7].

In this work, we used the Linear Genetic Programming (LGP, [5]) technique for solving SReg problems. Instead of working with trees, LGP evolves a linear program in an array to solve a particular task, being the solution directly convertible to an imperative programming language. LGP has been used to efficiently solve various problems in the literature such as symbolic regression, classification [5], [2], estimation models [14], and time series modeling [12]. Researches that compare LGP with basic GP [23], [5] show that LGP outperforms GP in several tasks, including SReg. All those results suggest that LGP has a high potential as an efficient Genetic Programming variant.

A well-known issue in GP and similar techniques that employ variable-length representations is called bloat: an excessive code growth without a corresponding improvement in fitness [25], [28], [29]. Bloat may cause several problems: (1) solutions may grow to a point that fills the system memory; (2) huge solutions are slow to be evaluated; (3) spurious parts of the solution may hinder its improvement because modifications of those parts do not change the fitness; (4) bloated solutions are black-boxes. Therefore, bloat is a major and active topic of study in GP [32], [1], [31], [17], [4], [18], [24], [9], [28].

Usually, researchers evaluate their bloat control methods in a few benchmark problems of distinct tasks, for instance, symbolic regression of benchmark functions, artificial ant [16], even-parity [15], and multiplexer [15]. On the other hand, the objective of the present work is to compare LGP variants, mostly in terms of selection mechanisms for bloat control, focusing on the SReg task. For such purpose, many distinct benchmark functions were selected. Here we proposed and developed a few variants of a basic LGP implementation, analyzed them from two points of view, and tried to establish a link between them to understand:

•
their behavior in the evolution process, in terms of the proportion of effective code in the individuals;
•
their effects in the quality of the solutions generated.

Those variants, explained in detail in a later section in this paper, include: (1) the usage of bloat control techniques to increase the presence of smaller (simpler) individuals in the population, (2) the usage of techniques to provide a higher percentage of effective code in the population, and (3) an operator that joins two successful individuals into one, intending to treat them as subfunctions. Furthermore, we applied the items 1 and 2 to 3, that is, using the operator that joins two solutions alongside with the techniques to both control bloat and increase the percentage of effective code. The study performed herein is an experimental one, trying to verify the usefulness of the proposed variants. A theory to understand and predict LGP behavior when using such variants is yet to be developed.

The remaining of this paper is structured as follows. Section 2 describes the Methodology of the work, presenting each variant in detail and its pseudocode. The experimental setup is introduced in Section 3. Results and the discussion are presented in Section 4. We end this paper with Conclusions and Future Works in Section 5.

Section snippets

Methods

In this section, we explain the basic LGP implementation we used and also present the variations of this basic implementation.

Experimental setup

As stated before, we will analyze the results of the experiments based on two main aspects: the effects of the proposed technique for the evolution process (individuals and population) and the effects of the proposed technique for the quality of the results.

For the first one, we monitor the average size or percentage of effective code in the population along a run of the algorithm being analyzed. That information may help to understand what the technique causes in the population and what may be

Results and discussion

This section presents the results and discussion related to the experiments. Section 4.1 concerns the first experiment, where the techniques to control bloat and increase the percentage of effective code are tested, and their results are compared with those of effmut. We also tested the regular union procedure. It is important to remind that besides union and its variants, the other methods can only grow solutions when the insertion operator (macro-mutation) is used, as there is no crossover.

Conclusions and future work

In this work, we used a baseline LGP algorithm that employs macro and micro-mutations (dubbed effmut), and implemented several non-parametric procedures, based on double tournament and proportional tournament, to control bloat and to increase the percentage of effective code. Experiments were conducted on 40 well-known benchmark functions to investigate the effects of those factors, with statistical analysis of the results.

We also proposed an operator – union – that borrows the idea of

Acknowledgments

This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, Grant 2013/20606-0) to L.F.D.P.S., Comissão Nacional de Desenvolvimento Científico e Tecnológico (CNPq Universal, Grant 486950/2013-1) and Coordenadoria de Apoio a Pesquisa e Ensino Superior (CAPES Science without Borders, Grant 12180-13-0) to V.V.M.

Léo Françoso dal Piccol Sotto is an undergraduate student of Computer Science at the Institute of Science and Technology UNIFESP, São Paulo, Brazil. He has published papers in international peer-reviewed conferences. His current research interests are evolutionary computation and machine learning.

References (34)

Lee Altenberg
The evolution of evolvability in genetic programming
Adv. Genet. Programm.
(1994)
Markus Brameier, Wolfgang Banzhaf, Effective Linear Genetic Programming, Technical Report, Department of Computer...
Markus Brameier, Wolfgang Banzhaf, Explicit control of diversity and effective variation distance in linear genetic...
Markus Brameier, Wolfgang Banzhaf, Neutral variations cause bloat in linear gp, in: Proceedings of the 6th European...
Markus Brameier et al.
Linear Genetic Programming
(2007)
Raphael Crawford-Marks, Lee Spector, Size control via size fair genetic operators in the pushgp genetic programming...
Vinícius Veloso De Melo, Kaizen programming, in: Proceedings of the 2014 Conference on Genetic and Evolutionary...
Stephen Dignum, Riccardo Poli, Generalisation of the limiting distribution of program sizes in tree-based genetic...
Stephen Dignum and Riccardo Poli, Operator equalisation and bloat free gp, 11th European Conference, EuroGP, Naples,...
Candida Ferreira
Gene expression programming in problem solving
Soft Computing and Industry
(2002)

Leo Francoso dal Piccol Sotto, Vinicius Veloso de Melo, Investigation of linear genetic programming techniques for...

Aytac Guven

Linear genetic programming for time-series modelling of daily flow rate

J. Earth Syst. Sci.

(2009)

Nguyen Xuan Hoai, Robert Ian McKay, D. Essam, R. Chau, Solving the symbolic regression problem with tree-adjunct...

B. Baran et al.

Estimation models generation using linear genetic programming

Clei Electron. J.

(2009)

R. John Koza

Genetic programming as a means for programming computers by natural selection

Stat. Comput. (UK)

(1994)

W.B. Langdon and R. Poli, Why ants are hard, Proceedings of the Third Annual Conference, pages 193–201, University of...

William B. Langdon, Quadratic bloat in genetic programming, in: GECCO, vol. 2000, Las Vegas, Nevada, USA. 2000, pp....

Cited by (15)

A similarity measure for Straight Line Programs and its application to control diversity in Genetic Programming
2022, Expert Systems with Applications
Citation Excerpt :
As the encodings used in GP have a non-linear structure, such as trees, it is harder to tackle the control of diversity (Burke et al., 2004). The problem of tree uncontrolled growth, known as the bloating problem in GP, leads to premature convergence (dal Piccol Sotto & de Melo, 2016). Preventing this problem is an implicit goal for researchers in GP, and different authors have proposed to modify genetic operators, fitness evaluation or selection schemes (Alfaro-Cid et al., 2008; de Jong et al., 2001; Liu et al., 2007) to solve the bloating problem while maintaining diversity in the population.
Finding a balance between diversity and convergence plays an important role in evolutionary algorithms to avoid premature convergence and to perform a better exploration of the search space. In the case of Genetic Programming, and more specifically for symbolic regression problems, different mechanisms have been devised to control diversity, ranging from novel crossover and/or mutation procedures to the design of distance measures that help genetic operators to increase diversity in the population. In this paper, we start from previous works where Straight Line Programs are used as an alternative representation to expression trees for symbolic regression, and develop a similarity measure based on edit distance in order to determine how different the Straight Line Programs in the population are. This measure is used in combination with the CHC algorithm strategy to control diversity in the population, and therefore to avoid local optima to solve symbolic regression problems. The proposal is first validated in a controlled scenario of benchmark datasets and it is compared with previous approaches to promote diversity in Genetic Programming. After that, the approach is also evaluated in a real world dataset of energy consumption data from a set of buildings of the University of Granada.
Block building programming for symbolic regression
2018, Neurocomputing
Symbolic regression that aims to detect underlying data-driven models has become increasingly important for industrial data analysis. For most existing algorithms such as genetic programming (GP), the convergence speed might be too slow for large-scale problems with a large number of variables. This situation may become even worse with increasing problem size. The aforementioned difficulty makes symbolic regression limited in practical applications. Fortunately, in many engineering problems, the independent variables in target models are separable or partially separable. This feature inspires us to develop a new approach, block building programming (BBP). BBP divides the original target function into several blocks, and further into factors. The factors are then modeled by an optimization engine (e.g. GP). Under such circumstances, BBP can make large reductions to the search space. The partition of separability is based on a special method, block and factor detection. Two different optimization engines are applied to test the performance of BBP on a set of symbolic regression problems. Numerical results show that BBP has a good capability of structure and coefficient optimization with high computational efficiency.
Semantic Linear Genetic Programming for Symbolic Regression
2024, IEEE Transactions on Cybernetics
Graph-based Linear Genetic Programming: A Case Study of Dynamic Scheduling
2022, GECCO 2022 - Proceedings of the 2022 Genetic and Evolutionary Computation Conference
Intelligent fitting global real-time task scheduling strategy for high-performance multi-core systems
2022, CAAI Transactions on Intelligence Technology
A Further Investigation to Improve Linear Genetic Programming in Dynamic Job Shop Scheduling
2022, Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022

View all citing articles on Scopus

Vinícius Veloso de Melo is an associate professor at the Institute of Science and Technology-UNIFESP, São Paulo, Brazil. He obtained his BSc in Computer Science from PUC-MG, Brazil, in 2002; his MSc and PhD in Computer Science from University of São Paulo, Brazil, in 2005 and 2009, respectively. He has published papers in international peer-reviewed journals and conferences. His current research interests are evolutionary computation, metaheuristics, and machine learning.

View full text