Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression
Introduction
Regression analysis is a statistical process of estimating relationships among variables and using that information to perform predictions. Usually, there is a set of independent explanatory variables and a single dependent response variable. Regression techniques are used to build a model that allows the correct/approximate prediction of the response variable using the explanatory variables as input. In fact, a model is assumed a priori, determined by the researcher, and the regression technique estimates the coefficients that minimize a certain criterion, for instance, the least-squared error. On the other hand, symbolic regression (SReg) consists of, without the a priori model, automatically finding a mathematical expression that adjusts as best as possible to these data [15]. That is, given the explanatory and response variables, an SReg technique has to find both the model and the coefficients that minimize it. Therefore, SReg is a harder problem than regular regression. Since one wishes to find the best model, a global optimization technique is supposed to be used, and Evolutionary Computation methods are widely employed for this task.
In order to solve SReg problems, Koza [15] proposed the Genetic Programming (GP) technique, an extension of the Genetic Algorithm (GA) technique [27]. While GA operates on a fixed-length array that is a codification of a problem׳s solution, GP manipulates a variable-length tree data-structure, that is a code to be used to solve the problem. Thus, techniques such as GP and similar approaches are proper for solving SReg problems and have been widely investigated [26], [22], [13], [10], [33], [7].
In this work, we used the Linear Genetic Programming (LGP, [5]) technique for solving SReg problems. Instead of working with trees, LGP evolves a linear program in an array to solve a particular task, being the solution directly convertible to an imperative programming language. LGP has been used to efficiently solve various problems in the literature such as symbolic regression, classification [5], [2], estimation models [14], and time series modeling [12]. Researches that compare LGP with basic GP [23], [5] show that LGP outperforms GP in several tasks, including SReg. All those results suggest that LGP has a high potential as an efficient Genetic Programming variant.
A well-known issue in GP and similar techniques that employ variable-length representations is called bloat: an excessive code growth without a corresponding improvement in fitness [25], [28], [29]. Bloat may cause several problems: (1) solutions may grow to a point that fills the system memory; (2) huge solutions are slow to be evaluated; (3) spurious parts of the solution may hinder its improvement because modifications of those parts do not change the fitness; (4) bloated solutions are black-boxes. Therefore, bloat is a major and active topic of study in GP [32], [1], [31], [17], [4], [18], [24], [9], [28].
Usually, researchers evaluate their bloat control methods in a few benchmark problems of distinct tasks, for instance, symbolic regression of benchmark functions, artificial ant [16], even-parity [15], and multiplexer [15]. On the other hand, the objective of the present work is to compare LGP variants, mostly in terms of selection mechanisms for bloat control, focusing on the SReg task. For such purpose, many distinct benchmark functions were selected. Here we proposed and developed a few variants of a basic LGP implementation, analyzed them from two points of view, and tried to establish a link between them to understand:
- •
their behavior in the evolution process, in terms of the proportion of effective code in the individuals;
- •
their effects in the quality of the solutions generated.
Those variants, explained in detail in a later section in this paper, include: (1) the usage of bloat control techniques to increase the presence of smaller (simpler) individuals in the population, (2) the usage of techniques to provide a higher percentage of effective code in the population, and (3) an operator that joins two successful individuals into one, intending to treat them as subfunctions. Furthermore, we applied the items 1 and 2 to 3, that is, using the operator that joins two solutions alongside with the techniques to both control bloat and increase the percentage of effective code. The study performed herein is an experimental one, trying to verify the usefulness of the proposed variants. A theory to understand and predict LGP behavior when using such variants is yet to be developed.
The remaining of this paper is structured as follows. Section 2 describes the Methodology of the work, presenting each variant in detail and its pseudocode. The experimental setup is introduced in Section 3. Results and the discussion are presented in Section 4. We end this paper with Conclusions and Future Works in Section 5.
Section snippets
Methods
In this section, we explain the basic LGP implementation we used and also present the variations of this basic implementation.
Experimental setup
As stated before, we will analyze the results of the experiments based on two main aspects: the effects of the proposed technique for the evolution process (individuals and population) and the effects of the proposed technique for the quality of the results.
For the first one, we monitor the average size or percentage of effective code in the population along a run of the algorithm being analyzed. That information may help to understand what the technique causes in the population and what may be
Results and discussion
This section presents the results and discussion related to the experiments. Section 4.1 concerns the first experiment, where the techniques to control bloat and increase the percentage of effective code are tested, and their results are compared with those of effmut. We also tested the regular union procedure. It is important to remind that besides union and its variants, the other methods can only grow solutions when the insertion operator (macro-mutation) is used, as there is no crossover.
In
Conclusions and future work
In this work, we used a baseline LGP algorithm that employs macro and micro-mutations (dubbed effmut), and implemented several non-parametric procedures, based on double tournament and proportional tournament, to control bloat and to increase the percentage of effective code. Experiments were conducted on 40 well-known benchmark functions to investigate the effects of those factors, with statistical analysis of the results.
We also proposed an operator – union – that borrows the idea of
Acknowledgments
This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, Grant 2013/20606-0) to L.F.D.P.S., Comissão Nacional de Desenvolvimento Científico e Tecnológico (CNPq Universal, Grant 486950/2013-1) and Coordenadoria de Apoio a Pesquisa e Ensino Superior (CAPES Science without Borders, Grant 12180-13-0) to V.V.M.
Léo Françoso dal Piccol Sotto is an undergraduate student of Computer Science at the Institute of Science and Technology UNIFESP, São Paulo, Brazil. He has published papers in international peer-reviewed conferences. His current research interests are evolutionary computation and machine learning.
References (34)
The evolution of evolvability in genetic programming
Adv. Genet. Programm.
(1994)- Markus Brameier, Wolfgang Banzhaf, Effective Linear Genetic Programming, Technical Report, Department of Computer...
- Markus Brameier, Wolfgang Banzhaf, Explicit control of diversity and effective variation distance in linear genetic...
- Markus Brameier, Wolfgang Banzhaf, Neutral variations cause bloat in linear gp, in: Proceedings of the 6th European...
- et al.
Linear Genetic Programming
(2007) - Raphael Crawford-Marks, Lee Spector, Size control via size fair genetic operators in the pushgp genetic programming...
- Vinícius Veloso De Melo, Kaizen programming, in: Proceedings of the 2014 Conference on Genetic and Evolutionary...
- Stephen Dignum, Riccardo Poli, Generalisation of the limiting distribution of program sizes in tree-based genetic...
- Stephen Dignum and Riccardo Poli, Operator equalisation and bloat free gp, 11th European Conference, EuroGP, Naples,...
Gene expression programming in problem solving
Soft Computing and Industry
(2002)
Linear genetic programming for time-series modelling of daily flow rate
J. Earth Syst. Sci.
Estimation models generation using linear genetic programming
Clei Electron. J.
Genetic programming as a means for programming computers by natural selection
Stat. Comput. (UK)
Cited by (15)
A similarity measure for Straight Line Programs and its application to control diversity in Genetic Programming
2022, Expert Systems with ApplicationsCitation Excerpt :As the encodings used in GP have a non-linear structure, such as trees, it is harder to tackle the control of diversity (Burke et al., 2004). The problem of tree uncontrolled growth, known as the bloating problem in GP, leads to premature convergence (dal Piccol Sotto & de Melo, 2016). Preventing this problem is an implicit goal for researchers in GP, and different authors have proposed to modify genetic operators, fitness evaluation or selection schemes (Alfaro-Cid et al., 2008; de Jong et al., 2001; Liu et al., 2007) to solve the bloating problem while maintaining diversity in the population.
Block building programming for symbolic regression
2018, NeurocomputingSemantic Linear Genetic Programming for Symbolic Regression
2024, IEEE Transactions on CyberneticsGraph-based Linear Genetic Programming: A Case Study of Dynamic Scheduling
2022, GECCO 2022 - Proceedings of the 2022 Genetic and Evolutionary Computation ConferenceIntelligent fitting global real-time task scheduling strategy for high-performance multi-core systems
2022, CAAI Transactions on Intelligence TechnologyA Further Investigation to Improve Linear Genetic Programming in Dynamic Job Shop Scheduling
2022, Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
Léo Françoso dal Piccol Sotto is an undergraduate student of Computer Science at the Institute of Science and Technology UNIFESP, São Paulo, Brazil. He has published papers in international peer-reviewed conferences. His current research interests are evolutionary computation and machine learning.
Vinícius Veloso de Melo is an associate professor at the Institute of Science and Technology-UNIFESP, São Paulo, Brazil. He obtained his BSc in Computer Science from PUC-MG, Brazil, in 2002; his MSc and PhD in Computer Science from University of São Paulo, Brazil, in 2005 and 2009, respectively. He has published papers in international peer-reviewed journals and conferences. His current research interests are evolutionary computation, metaheuristics, and machine learning.