Elsevier

Neurocomputing

Volume 180, 5 March 2016, Pages 79-93
Neurocomputing

Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

https://doi.org/10.1016/j.neucom.2015.10.109Get rights and content

Highlights

  • We studied tournament procedures for bloat control in LGP.

  • We proposed seven variations of known tournament procedures.

  • The variations were tested in 35 symbolic regression problems.

  • The proposed procedures show competitive or better solution quality than the original procedure, but with smaller solutions.

Abstract

Linear Genetic Programming (LGP) is an Evolutionary Computation algorithm, inspired in the Genetic Programming (GP) algorithm. Instead of using the standard tree representation of GP, LGP evolves a linear program, which causes a graph-based data flow with code reuse. LGP has been shown to outperform GP in several problems, including Symbolic Regression (SReg), and to produce simpler solutions. In this paper, we propose several LGP variants and compare them with a traditional LGP algorithm on a set of benchmark SReg functions from the literature. The main objectives of the variants were to both control bloat and privilege useful code in the population. Here we evaluate their effects during the evolution process and in the quality of the final solutions. Analysis of the results showed that bloat control and effective code maintenance worked, but they did not guarantee improvement in solution quality.

Introduction

Regression analysis is a statistical process of estimating relationships among variables and using that information to perform predictions. Usually, there is a set of independent explanatory variables and a single dependent response variable. Regression techniques are used to build a model that allows the correct/approximate prediction of the response variable using the explanatory variables as input. In fact, a model is assumed a priori, determined by the researcher, and the regression technique estimates the coefficients that minimize a certain criterion, for instance, the least-squared error. On the other hand, symbolic regression (SReg) consists of, without the a priori model, automatically finding a mathematical expression that adjusts as best as possible to these data [15]. That is, given the explanatory and response variables, an SReg technique has to find both the model and the coefficients that minimize it. Therefore, SReg is a harder problem than regular regression. Since one wishes to find the best model, a global optimization technique is supposed to be used, and Evolutionary Computation methods are widely employed for this task.

In order to solve SReg problems, Koza [15] proposed the Genetic Programming (GP) technique, an extension of the Genetic Algorithm (GA) technique [27]. While GA operates on a fixed-length array that is a codification of a problem׳s solution, GP manipulates a variable-length tree data-structure, that is a code to be used to solve the problem. Thus, techniques such as GP and similar approaches are proper for solving SReg problems and have been widely investigated [26], [22], [13], [10], [33], [7].

In this work, we used the Linear Genetic Programming (LGP, [5]) technique for solving SReg problems. Instead of working with trees, LGP evolves a linear program in an array to solve a particular task, being the solution directly convertible to an imperative programming language. LGP has been used to efficiently solve various problems in the literature such as symbolic regression, classification [5], [2], estimation models [14], and time series modeling [12]. Researches that compare LGP with basic GP [23], [5] show that LGP outperforms GP in several tasks, including SReg. All those results suggest that LGP has a high potential as an efficient Genetic Programming variant.

A well-known issue in GP and similar techniques that employ variable-length representations is called bloat: an excessive code growth without a corresponding improvement in fitness [25], [28], [29]. Bloat may cause several problems: (1) solutions may grow to a point that fills the system memory; (2) huge solutions are slow to be evaluated; (3) spurious parts of the solution may hinder its improvement because modifications of those parts do not change the fitness; (4) bloated solutions are black-boxes. Therefore, bloat is a major and active topic of study in GP [32], [1], [31], [17], [4], [18], [24], [9], [28].

Usually, researchers evaluate their bloat control methods in a few benchmark problems of distinct tasks, for instance, symbolic regression of benchmark functions, artificial ant [16], even-parity [15], and multiplexer [15]. On the other hand, the objective of the present work is to compare LGP variants, mostly in terms of selection mechanisms for bloat control, focusing on the SReg task. For such purpose, many distinct benchmark functions were selected. Here we proposed and developed a few variants of a basic LGP implementation, analyzed them from two points of view, and tried to establish a link between them to understand:

  • their behavior in the evolution process, in terms of the proportion of effective code in the individuals;

  • their effects in the quality of the solutions generated.

Those variants, explained in detail in a later section in this paper, include: (1) the usage of bloat control techniques to increase the presence of smaller (simpler) individuals in the population, (2) the usage of techniques to provide a higher percentage of effective code in the population, and (3) an operator that joins two successful individuals into one, intending to treat them as subfunctions. Furthermore, we applied the items 1 and 2 to 3, that is, using the operator that joins two solutions alongside with the techniques to both control bloat and increase the percentage of effective code. The study performed herein is an experimental one, trying to verify the usefulness of the proposed variants. A theory to understand and predict LGP behavior when using such variants is yet to be developed.

The remaining of this paper is structured as follows. Section 2 describes the Methodology of the work, presenting each variant in detail and its pseudocode. The experimental setup is introduced in Section 3. Results and the discussion are presented in Section 4. We end this paper with Conclusions and Future Works in Section 5.

Section snippets

Methods

In this section, we explain the basic LGP implementation we used and also present the variations of this basic implementation.

Experimental setup

As stated before, we will analyze the results of the experiments based on two main aspects: the effects of the proposed technique for the evolution process (individuals and population) and the effects of the proposed technique for the quality of the results.

For the first one, we monitor the average size or percentage of effective code in the population along a run of the algorithm being analyzed. That information may help to understand what the technique causes in the population and what may be

Results and discussion

This section presents the results and discussion related to the experiments. Section 4.1 concerns the first experiment, where the techniques to control bloat and increase the percentage of effective code are tested, and their results are compared with those of effmut. We also tested the regular union procedure. It is important to remind that besides union and its variants, the other methods can only grow solutions when the insertion operator (macro-mutation) is used, as there is no crossover.

In

Conclusions and future work

In this work, we used a baseline LGP algorithm that employs macro and micro-mutations (dubbed effmut), and implemented several non-parametric procedures, based on double tournament and proportional tournament, to control bloat and to increase the percentage of effective code. Experiments were conducted on 40 well-known benchmark functions to investigate the effects of those factors, with statistical analysis of the results.

We also proposed an operator – union – that borrows the idea of

Acknowledgments

This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, Grant 2013/20606-0) to L.F.D.P.S., Comissão Nacional de Desenvolvimento Científico e Tecnológico (CNPq Universal, Grant 486950/2013-1) and Coordenadoria de Apoio a Pesquisa e Ensino Superior (CAPES Science without Borders, Grant 12180-13-0) to V.V.M.

Léo Françoso dal Piccol Sotto is an undergraduate student of Computer Science at the Institute of Science and Technology UNIFESP, São Paulo, Brazil. He has published papers in international peer-reviewed conferences. His current research interests are evolutionary computation and machine learning.

References (34)

  • Lee Altenberg

    The evolution of evolvability in genetic programming

    Adv. Genet. Programm.

    (1994)
  • Markus Brameier, Wolfgang Banzhaf, Effective Linear Genetic Programming, Technical Report, Department of Computer...
  • Markus Brameier, Wolfgang Banzhaf, Explicit control of diversity and effective variation distance in linear genetic...
  • Markus Brameier, Wolfgang Banzhaf, Neutral variations cause bloat in linear gp, in: Proceedings of the 6th European...
  • Markus Brameier et al.

    Linear Genetic Programming

    (2007)
  • Raphael Crawford-Marks, Lee Spector, Size control via size fair genetic operators in the pushgp genetic programming...
  • Vinícius Veloso De Melo, Kaizen programming, in: Proceedings of the 2014 Conference on Genetic and Evolutionary...
  • Stephen Dignum, Riccardo Poli, Generalisation of the limiting distribution of program sizes in tree-based genetic...
  • Stephen Dignum and Riccardo Poli, Operator equalisation and bloat free gp, 11th European Conference, EuroGP, Naples,...
  • Candida Ferreira

    Gene expression programming in problem solving

    Soft Computing and Industry

    (2002)
  • Leo Francoso dal Piccol Sotto, Vinicius Veloso de Melo, Investigation of linear genetic programming techniques for...
  • Aytac Guven

    Linear genetic programming for time-series modelling of daily flow rate

    J. Earth Syst. Sci.

    (2009)
  • Nguyen Xuan Hoai, Robert Ian McKay, D. Essam, R. Chau, Solving the symbolic regression problem with tree-adjunct...
  • B. Baran et al.

    Estimation models generation using linear genetic programming

    Clei Electron. J.

    (2009)
  • R. John Koza

    Genetic programming as a means for programming computers by natural selection

    Stat. Comput. (UK)

    (1994)
  • W.B. Langdon and R. Poli, Why ants are hard, Proceedings of the Third Annual Conference, pages 193–201, University of...
  • William B. Langdon, Quadratic bloat in genetic programming, in: GECCO, vol. 2000, Las Vegas, Nevada, USA. 2000, pp....
  • Cited by (15)

    • A similarity measure for Straight Line Programs and its application to control diversity in Genetic Programming

      2022, Expert Systems with Applications
      Citation Excerpt :

      As the encodings used in GP have a non-linear structure, such as trees, it is harder to tackle the control of diversity (Burke et al., 2004). The problem of tree uncontrolled growth, known as the bloating problem in GP, leads to premature convergence (dal Piccol Sotto & de Melo, 2016). Preventing this problem is an implicit goal for researchers in GP, and different authors have proposed to modify genetic operators, fitness evaluation or selection schemes (Alfaro-Cid et al., 2008; de Jong et al., 2001; Liu et al., 2007) to solve the bloating problem while maintaining diversity in the population.

    • Graph-based Linear Genetic Programming: A Case Study of Dynamic Scheduling

      2022, GECCO 2022 - Proceedings of the 2022 Genetic and Evolutionary Computation Conference
    • A Further Investigation to Improve Linear Genetic Programming in Dynamic Job Shop Scheduling

      2022, Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
    View all citing articles on Scopus

    Léo Françoso dal Piccol Sotto is an undergraduate student of Computer Science at the Institute of Science and Technology UNIFESP, São Paulo, Brazil. He has published papers in international peer-reviewed conferences. His current research interests are evolutionary computation and machine learning.

    Vinícius Veloso de Melo is an associate professor at the Institute of Science and Technology-UNIFESP, São Paulo, Brazil. He obtained his BSc in Computer Science from PUC-MG, Brazil, in 2002; his MSc and PhD in Computer Science from University of São Paulo, Brazil, in 2005 and 2009, respectively. He has published papers in international peer-reviewed journals and conferences. His current research interests are evolutionary computation, metaheuristics, and machine learning.

    View full text