Top

2016 | Book

Read chapter Read first chapter

Genetic Programming Theory and Practice XIII

Editors: Rick Riolo, W.P. Worzel, Mark Kotanchek, Arthur Kordon

Publisher: Springer International Publishing

Book Series : Genetic and Evolutionary Computation

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

These contributions, written by the foremost international researchers and practitioners of Genetic Programming (GP), explore the synergy between theoretical and empirical results on real-world problems, producing a comprehensive view of the state of the art in GP. Topics in this volume include: multi-objective genetic programming, learning heuristics, Kaizen programming, Evolution of Everything (EvE), lexicase selection, behavioral program synthesis, symbolic regression with noisy training data, graph databases, and multidimensional clustering. It also covers several chapters on best practices and lesson learned from hands-on experience. Additional application areas include financial operations, genetic analysis, and predicting product choice. Readers will discover large-scale, real-world applications of GP to a variety of problem domains via in-depth presentations of the latest and most significant results.

Frontmatter

Evolving Simple Symbolic Regression Models by Multi-Objective Genetic Programming

Abstract

In this chapter we examine how multi-objective genetic programming can be used to perform symbolic regression and compare its performance to single-objective genetic programming. Multi-objective optimization is implemented by using a slightly adapted version of NSGA-II, where the optimization objectives are the model’s prediction accuracy and its complexity. As the model complexity is explicitly defined as an objective, the evolved symbolic regression models are simpler and more parsimonious when compared to models generated by a single-objective algorithm. Furthermore, we define a new complexity measure that includes syntactical and semantic information about the model, while still being efficiently computed, and demonstrate its performance on several benchmark problems. As a result of the multi-objective approach the appropriate model length and the functions included in the models are automatically determined without the necessity to specify them a-priori.

Michael Kommenda, Gabriel Kronberger, Michael Affenzeller, Stephan M. Winkler, Bogdan Burlacu

Learning Heuristics for Mining RNA Sequence-Structure Motifs

Abstract

The computational identification of conserved motifs in RNA molecules is a major—yet largely unsolved—problem. Structural conservation serves as strong evidence for important RNA functionality. Thus, comparative structure analysis is the gold standard for the discovery and interpretation of functional RNAs.In this paper we focus on one of the functional RNA motif types, sequence-structure motifs in RNA molecules, which marks the molecule as targets to be recognized by other molecules.We present a new approach for the detection of RNA structure (including pseudoknots), which is conserved among a set of unaligned RNA sequences. Our method extends previous approaches for this problem, which were based on first identifying conserved stems and then assembling them into complex structural motifs. The novelty of our approach is in simultaneously preforming both the identification and the assembly of these stems. We believe this novel unified approach offers a more informative model for deciphering the evolution of functional RNAs, where the sets of stems comprising a conserved motif co-evolve as a correlated functional unit.Since the task of mining RNA sequence-structure motifs can be addressed by solving the maximum weighted clique problem in an n-partite graph, we translate the maximum weighted clique problem into a state graph. Then, we gather and define domain knowledge and low-level heuristics for this domain. Finally, we learn hyper-heuristics for this domain, which can be used with heuristic search algorithms (e.g., A*, IDA*) for the mining task.The hyper-heuristics are evolved using HH-Evolver, a tool for domain-specific, hyper-heuristic evolution. Our approach is designed to overcome the computational limitations of current algorithms, and to remove the necessity of previous assumptions that were used for sparsifying the graph.This is still work in progress and as yet we have no results to report. However, given the interest in the methodology and its previous success in other domains we are hopeful that these shall be forthcoming soon.

Achiya Elyasaf, Pavel Vaks, Nimrod Milo, Moshe Sipper, Michal Ziv-Ukelson

Kaizen Programming for Feature Construction for Classification

Abstract

A data set for classification is commonly composed of a set of features defining the data space representation and one attribute corresponding to the instances’ class. A classification tool has to discover how to separate classes based on features, but the discovery of useful knowledge may be hampered by inadequate or insufficient features. Pre-processing steps for the automatic construction of new high-level features proposed to discover hidden relationships among features and to improve classification quality. Here we present a new tool for high-level feature construction: Kaizen Programming. This tool can construct many complementary/dependent high-level features simultaneously. We show that our approach outperforms related methods on well-known binary-class medical data sets using a decision-tree classifier, achieving greater accuracy and smaller trees.

Vinícius Veloso de Melo, Wolfgang Banzhaf

GP As If You Meant It: An Exercise for Mindful Practice

Abstract

In this contribution I present a kata called “GP As If You Meant It”, aimed at advanced users of genetic programming. Inspired by code katas that are popular among software developers, it’s an exercise designed to help participants hone their skills through mindful practice. Its intent is to surface certain unquestioned habits common in our field: to make the participants painfully aware of the tacit justification for certain GP algorithm design decisions they may otherwise take for granted. In the exercise, the human players are charged with trying to “rescue” an ineffectual but unstoppable GP system (which is the other “player”), which has been set up to only use “random guessing”—but they must do so by incrementally modifying the search process without interrupting it. The exercise is a game for two players, plus a Facilitator who acts as a referee. The human “User” player examines the state of the GP run in order to make amendments to its rules, using a very limited toolkit. The other “player” is the automated GP System itself, which adds to a growing population of solutions by applying the search operators and evaluation functions specified by the User player. The User’s goal is to convince the System to produce “good enough” answers to a target supervised learning problem chosen by the Facilitator. To further complicate the task, the User must also provide the Facilitator with convincing justifications, or warrants, which explain each move she makes. The Facilitator chooses the initial search problem, provides training data, and most importantly is empowered to disqualify any of the User’s moves if unconvinced by the accompanying warrants. As a result, the User is forced to work around our field’s most insidious habit: that of “stopping it and starting over again with different parameters”. In the process of working within these constraints, the participants—Facilitator and User—are made mindful of the habits they have already developed, tacitly or explicitly, for coping with “pathologies” and “symptoms” encountered in their more typical work with GP.

William A. Tozier

nPool: Massively Distributed Simultaneous Evolution and Cross-Validation in EC-Star

Abstract

We introduce a cross-validation algorithm called nPool that can be applied in a distributed fashion. Unlike classic k-fold cross-validation, the data segments are mutually exclusive, and training takes place only on one segment. This system is well suited to run in concert with the EC-Star distributed Evolutionary system, cross-validating solution candidates during a run. The system is tested with different numbers of validation segments using a real-world problem of classifying ICU blood-pressure time series.

Babak Hodjat, Hormoz Shahrzad

Highly Accurate Symbolic Regression with Noisy Training Data

Abstract

As symbolic regression (SR) has advanced into the early stages of commercial exploitation, the poor accuracy of SR, still plaguing even the most advanced commercial packages, has become an issue for early adopters. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth.

At a minimum, users expect the response surface of the SR tool to be easily understood, so that the user can know a priori on what classes of problems to expect excellent, average, or poor accuracy. Poor or unknown accuracy is a hindrance to greater academic and industrial acceptance of SR tools.

In two previous papers, we published a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, is described in detail in these two previous papers. This algorithm is extremely accurate, in reasonable time on a single processor, for from 25 up to 3000 features (columns).

Extensive statistically correct, out of sample training and testing, demonstrated the extreme accuracy algorithm’s advantages over a previously published base line Pareto algorithm in case where the training and testing data contained zero noise.

While the algorithm’s extreme accuracy for deep problems with a large number of features on noiseless training data is an impressive advance, there are many very important academic and industrial SR problems where the training data is very noisy.

In this chapter we test the extreme accuracy algorithm and compare the results with the previously published baseline Pareto algorithm. Both algorithms’ performance are compared on a set of complex representative problems (from 25 to 3000 features), on noiseless training, on noisy training data, and on noisy training data with range shifted testing data.

The enhanced algorithm is shown to be robust, with definite advantages over the baseline Pareto algorithm, performing well even in the face of noisy training data and range shifted testing data.

Michael F. Korns

Using Genetic Programming for Data Science: Lessons Learned

Abstract

In this chapter we present a case study to demonstrate how the current state-of-the-art Genetic Programming (GP) fairs as a tool for the emerging field of Data Science. Data Science refers to the practice of extracting knowledge from data, often Big Data, to glean insights useful for predicting business, political or societal outcomes. Data Science tools are important to the practice as they allow Data Scientists to be productive and accurate. GP has many features that make it amenable as a tool for Data Science, but GP is not widely considered as a Data Science method as of yet. Thus, we performed a real-world comparison of GP with a popular Data Science method to understand its strengths and weaknesses. GP proved to find equally strong solutions, leveraged the new Big Data infrastructure, and was able to provide several benefits like direct feature importance and solution confidence. GP lacked the ability to quickly build and test models, required much more intensive computing power, and, due to its lack of commercial maturity, created some challenges for productization as well as integration with data management and visualization capabilities. The lessons learned leads to several recommendations that provide a path for future research to focus on key areas to improve GP as a Data Science tool.

Steven Gustafson, Ram Narasimhan, Ravi Palla, Aisha Yousuf

The Evolution of Everything (EvE) and Genetic Programming

Abstract

The Internet is entering a new period of growth driven by an increasing number of processors connected at the edge of the Internet. Many of these processors are sensors that continuously collect data. By 2020, it is projected that there may be more than 20 billion (1000 million) devices connected to the Internet. Collectively these devices are called the Internet of Things (IoT) or the Internet of Everything (IoE). The sheer volume of the data that will be gathered creates new problems for an economy that is increasingly driven by data analytics. It is likely that the devices at the edge of the Internet will take part in the processing of data for analytics by using distributed computing among edge devices. Genetic Programming could play a unique role in this environment because of its ability not only to gather and analyze data, but to control the evolution and use of other machine learning algorithms. The confluence of unimaginable streams of real-world data and emergent behaviors may give rise to the question of whether the evolution of intelligence in the natural world can be recreated using evolutionary tools.

W. P. Worzel

Lexicase Selection for Program Synthesis: A Diversity Analysis

Abstract

Lexicase selection is a selection method for evolutionary computation in which individuals are selected by filtering the population according to performance on test cases, considered in random order. When used as the parent selection method in genetic programming, lexicase selection has been shown to provide significant improvements in problem-solving power. In this chapter we investigate the reasons for the success of lexicase selection, focusing on measures of population diversity. We present data from eight program synthesis problems and compare lexicase selection to tournament selection and selection based on implicit fitness sharing. We conclude that lexicase selection does indeed produce more diverse populations, which helps to explain the utility of lexicase selection for program synthesis.

Thomas Helmuth, Nicholas Freitag McPhee, Lee Spector

Behavioral Program Synthesis: Insights and Prospects

Abstract

Genetic programming (GP) is a stochastic, iterative generate-and-test approach to synthesizing programs from tests, i.e. examples of the desired input-output mapping. The number of passed tests, or the total error in continuous domains, is a natural objective measure of a program’s performance and a common yardstick when experimentally comparing algorithms. In GP, it is also by default used to guide the evolutionary search process. An assumption that an objective function should also be an efficient ‘search driver’ is common for all metaheuristics, such as the evolutionary algorithms which GP is a member of. Programs are complex combinatorial structures that exhibit even more complex input-output behavior, and in this chapter we discuss why this complexity cannot be effectively reflected by a single scalar objective. In consequence, GP algorithms are systemically ‘underinformed’ about the characteristics of programs they operate on, and pay for this with unsatisfactory performance and limited scalability. This chapter advocates behavioral program synthesis, where programs are characterized by informative execution traces that enable multifaceted evaluation and substantially change the roles of components in an evolutionary infrastructure. We provide a unified perspective on past work in this area, discuss the consequences of the behavioral viewpoint, outlining the future avenues for program synthesis and the wider application areas that lie beyond.

Krzysztof Krawiec, Jerry Swan, Una-May O’Reilly

Using Graph Databases to Explore the Dynamics of Genetic Programming Runs

Abstract

For both practical reasons and those of habit, most evolutionary computation research is presented in highly summary form. These summaries, however, often obscure or completely mask the profusion of specific selections, crossovers, and mutations that are ultimately responsible for the aggregate behaviors we’re interested in. In this chapter we take a different approach and use the Neo4j graph database system to record and analyze the entire genealogical history of a set of genetic programming runs. We then explore a few of these runs in detail, discovering important properties of lexicase selection; these may in turn help us better understand the dynamics of lexicase selection, and the ways in which it differs from tournament selection. More broadly, we illustrate the value of recording and analyzing this level of detail, both as a means of understanding the dynamics of particular runs, and as a way of generating questions and ideas for subsequent, broader study.

Nicholas Freitag McPhee, David Donatucci, Thomas Helmuth

Predicting Product Choice with Symbolic Regression and Classification

Abstract

Market researchers often conduct surveys to measure how much value consumers place on the various features of a product. The resulting data should enable managers to combine these utility values in different ways to predict the market share of a product with a new configuration of features. Researchers assess the accuracy of these choice models by measuring the extent to which the summed utilities can predict actual market shares when respondents choose from sets of complete products. The current paper includes data from 201 consumers who gave ratings to 18 cell phone features and then ranked eight complete cell phones. A simple summing of the utility values predicted the correct product on the ranking task for 22.8 % of respondents. Another accuracy measurement is to compare the market shares for each product using the ranking task and the estimated market shares based on summed utilities. This produced a mean absolute difference between ranked and estimated market shares of 7.8 %. The current paper applied two broad strategies to improve these prediction methods. Various evolutionary search methods were used to classify the data for each respondent to predict one of eight discrete choices. The fitness measure of the classification approach seeks to reduce the Classification Error Percent (CEP) which minimizes the percent of incorrect classifications. This produced a significantly better fit with the hit rate rising from 22.8 to 35.8 %. The mean absolute deviation between actual and estimated market shares declined from 7.8 to 6.1 % (p. <0.01). A simple language specification will be illustrated to define symbolic regression and classification searches.

Philip Truscott, Michael F. Korns

Multiclass Classification Through Multidimensional Clustering

Abstract

Classification is one of the most important machine learning tasks in science and engineering. However, it can be a difficult task, in particular when a high number of classes is involved. Genetic Programming, despite its recognized successfulness in so many different domains, is one of the machine learning methods that typically struggles, and often fails, to provide accurate solutions for multiclass classification problems. We present a novel algorithm for tree based GP that incorporates some ideas on the representation of the solution space in higher dimensions, and can be generalized to other types of GP. We test three variants of this new approach on a large set of benchmark problems from several different sources, and observe their competitiveness against the most successful state-of-the-art classifiers like Random Forests, Random Subspaces and Multilayer Perceptron.

Sara Silva, Luis Muñoz, Leonardo Trujillo, Vijay Ingalalli, Mauro Castelli, Leonardo Vanneschi

Prime-Time: Symbolic Regression Takes Its Place in the Real World

Abstract

In this chapter we review a number of real-world applications where symbolic regression was used recently and with great success. Industrial scale symbolic regression armed with the power to select right variables and variable combinations, build robust trustable predictions and guide experimentation has undoubtedly earned its place in industrial process optimization, business forecasting, product design and now complex systems modeling and policy making.

Sean Stijven, Ekaterina Vladislavleva, Arthur Kordon, Lander Willem, Mark E. Kotanchek

Backmatter

Title: Genetic Programming Theory and Practice XIII
Editors: Rick Riolo
W.P. Worzel
Mark Kotanchek
Arthur Kordon
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-34223-8
Print ISBN: 978-3-319-34221-4
DOI: https://doi.org/10.1007/978-3-319-34223-8

Springer Professional

Genetic Programming Theory and Practice XIII

About this book

Table of Contents

Frontmatter

Evolving Simple Symbolic Regression Models by Multi-Objective Genetic Programming

Learning Heuristics for Mining RNA Sequence-Structure Motifs

Kaizen Programming for Feature Construction for Classification

GP As If You Meant It: An Exercise for Mindful Practice

nPool: Massively Distributed Simultaneous Evolution and Cross-Validation in EC-Star

Highly Accurate Symbolic Regression with Noisy Training Data

Using Genetic Programming for Data Science: Lessons Learned

The Evolution of Everything (EvE) and Genetic Programming

Lexicase Selection for Program Synthesis: A Diversity Analysis

Behavioral Program Synthesis: Insights and Prospects

Using Graph Databases to Explore the Dynamics of Genetic Programming Runs

Predicting Product Choice with Symbolic Regression and Classification

Multiclass Classification Through Multidimensional Clustering

Prime-Time: Symbolic Regression Takes Its Place in the Real World

Backmatter

Premium Partner