nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Genetic Programming Theory and Practice XI

herausgegeben von: Rick Riolo, Jason H. Moore, Mark Kotanchek

Verlag: Springer New York

Buchreihe : Genetic and Evolutionary Computation

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

These contributions, written by the foremost international researchers and practitioners of Genetic Programming (GP), explore the synergy between theoretical and empirical results on real-world problems, producing a comprehensive view of the state of the art in GP. Topics in this volume include: evolutionary constraints, relaxation of selection mechanisms, diversity preservation strategies, flexing fitness evaluation, evolution in dynamic environments, multi-objective and multi-modal selection, foundations of evolvability, evolvable and adaptive evolutionary operators, foundation of injecting expert knowledge in evolutionary search, analysis of problem difficulty and required GP algorithm complexity, foundations in running GP on the cloud – communication, cooperation, flexible implementation, and ensemble methods. Additional focal points for GP symbolic regression are: (1) The need to guarantee convergence to solutions in the function discovery mode; (2) Issues on model validation; (3) The need for model analysis workflows for insight generation based on generated GP solutions – model exploration, visualization, variable selection, dimensionality analysis; (4) Issues in combining different types of data. Readers will discover large-scale, real-world applications of GP to a variety of problem domains via in-depth presentations of the latest and most significant results.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Extreme Accuracy in Symbolic Regression

Abstract

Although recent advances in symbolic regression (SR) have promoted the field into the early stages of commercial exploitation, the poor accuracy of SR is still plaguing even the most advanced commercial packages today. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. Poor accuracy is a hindrance to greater academic and industrial acceptance of SR tools.

In a previous paper, the poor accuracy of Symbolic Regression was explored, and several classes of test formulas, which prove intractable for SR, were examined. An understanding of why these test problems prove intractable was developed. In another paper a baseline Symbolic Regression algorithm was developed with specific techniques for optimizing embedded real numbers constants. These previous steps have placed us in a position to make an attempt at vanquishing the SR accuracy problem.

In this chapter we develop a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, is described in detail. A definition of extreme accuracy is provided, and an informal argument of extreme SR accuracy is outlined in this chapter. Given the critical importance of accuracy in SR, it is our suspicion that in the future all commercial Symbolic Regression packages will use this algorithm or a substitute for this algorithm.

Michael F. Korns

Chapter 2. Exploring Interestingness in a Computational Evolution System for the Genome-Wide Genetic Analysis of Alzheimer’s Disease

Abstract

Susceptibility to Alzheimer’s disease is likely due to complex interaction among many genetic and environmental factors. Identifying complex genetic effects in large data sets will require computational methods that extend beyond what parametric statistical methods such as logistic regression can provide. We have previously introduced a computational evolution system (CES) that uses genetic programming (GP) to represent genetic models of disease and to search for optimal models in a rugged fitness landscape that is effectively infinite in size. The CES approach differs from other GP approaches in that it is able to learn how to solve the problem by generating its own operators. A key feature is the ability for the operators to use expert knowledge to guide the stochastic search. We have previously shown that CES is able to discover nonlinear genetic models of disease susceptibility in both simulated and real data. The goal of the present study was to introduce a measure of interestingness into the modeling process. Here, we define interestingness as a measure of non-additive gene-gene interactions. That is, we are more interested in those CES models that include attributes that exhibit synergistic effects on disease risk. To implement this new feature we first pre-processed the data to measure all pairwise gene-gene interaction effects using entropy-based methods. We then provided these pre-computed measures to CES as expert knowledge and as one of three fitness criteria in three-dimensional Pareto optimization. We applied this new CES algorithm to an Alzheimer’s disease data set with approximately 520,000 genetic attributes. We show that this approach discovers more interesting models with the added benefit of improving classification accuracy. This study demonstrates the applicability of CES to genome-wide genetic analysis using expert knowledge derived from measures of interestingness.

Jason H. Moore, Douglas P. Hill, Andrew Saykin, Li Shen

Chapter 3. Optimizing a Cloud Contract Portfolio Using Genetic Programming-Based Load Models

Abstract

Infrastructure-as-a-Service (IaaS) cloud providers offer a number of different tariff structures. The user has to balance the flexibility of the often quoted pay-by-the-hour, fixed price (“on demand”) model against the lower-cost-per-hour rate of a “reserved contract”. These tariff structures offer a significantly reduced cost per server hour (up to 50 %), in exchange for an up-front payment by the consumer. In order to reduce costs using these reserved contracts, a user has to make an estimation of its future compute demands, and purchase reserved contracts accordingly. The key to optimizing these cost benefits is to have an accurate model of the customer’s future compute load – where that load can have a variety of trends and cyclic behaviour on multiple time scales. In this chapter, we use genetic programming to develop load models for a number of large-scale web sites based on real-world data. The predicted future load is subsequently used by a resource manager to optimize the amount of IaaS servers a consumer should allocate at a cloud provider, and the optimal tariff plans (from a cost perspective) for that allocation. Our results illustrate the benefits of load forecasting for cost-efficient IaaS portfolio selection. They also might be of interest for the Genetic Programming (GP) community as a demonstration that GP symbolic regression can be successfully used for modelling discrete time series and has a tremendous potential for time lag identification and model structure discovery.

Sean Stijven, Ruben Van den Bossche, Ekaterina Vladislavleva, Kurt Vanmechelen, Jan Broeckhove, Mark Kotanchek

Chapter 4. Maintenance of a Long Running Distributed Genetic Programming System for Solving Problems Requiring Big Data

Abstract

We describe a system, ECStar, that outstrips many scaling aspects of extant genetic programming systems. One instance in the domain of financial strategies has executed for extended durations (months to years) on nodes distributed around the globe. ECStar system instances are almost never stopped and restarted, though they are resource elastic. Instead they are interactively redirected to different parts of the problem space and updated with up-to-date learning. Their non-reproducibility (i.e. single “play of the tape” process) due to their complexity makes them similar to real biological systems. In this contribution we focus upon how ECStar introduces a provocative, important, new paradigm for GP by its sheer size and complexity. ECStar’s scale, volunteer compute nodes and distributed hub-and-spoke design have implications on how a multi-node instance is managed. We describe the set up, deployment, operation and update of an instance of such a large, distributed and long running system. Moreover, we outline how ECStar is designed to allow manual guidance and re-alignment of its evolutionary search trajectory.

Babak Hodjat, Erik Hemberg, Hormoz Shahrzad, Una-May O’Reilly

Chapter 5. Grounded Simulation: Using Simulated Evolution to Guide Embodied Evolution

Abstract

Flash memory is a type of non-volatile memory that is being used at an ever-increasing rate in enterprise level storage devices. It is extremely fast (compared to magnetic disks), requires low power and generates very little heat. Unfortunately, it has a relatively short life span, as it wears out over time, compromising its ability to retain data. Previous work has used GAs to tune the control parameters of Flash to trade retention for endurance and, although successful, was prohibitively costly in terms of time, as all testing had to be done in hardware. This chapter describes the next stages in this work, which use a combination of simulated (faster, but less reliable) and embodied (slower, but more reliable) evolution to produce internal parameter sets for Flash memory that increase the endurance by an order of magnitude while still maintaining an industry accepted level of retention.

Conor Ryan, Joe Sullivan, Barry Fitzgerald

Chapter 6. Applying Genetic Programming in Business Forecasting

Abstract

Since the global recession of 2008–2009, it has been much more widely understood that reliable economic forecasting is needed in business decision-making. Of special interest are the forecasting methods based on explanatory variables (economic drivers), the most popular of which is the Auto-Regressive Integrated Moving-Average with eXplanatory variables (ARIMAX) model. A limitation of this approach, however, is the assumption of linear relationships between the explanatory variables and the target variable. Genetic programming is a potential solution for representing nonlinearity and a hybrid scheme of integrating static and dynamic nonlinear transforms into the ARIMAX models is proposed in the chapter. From an implementation point of the view the proposed solution has several advantages, such as: optimal synergy between two well-known approaches like GP and ARIMAX, avoiding the need for developing a solid theoretical alternative for nonlinear time series modeling, using available forecasting software, and low efforts to train the final user. The proposed approach is illustrated with two examples from real business applications in the area of raw materials forecasting.

Arthur K. Kordon

Chapter 7. Explaining Unemployment Rates with Symbolic Regression

Abstract

Much of the research on the accuracy of symbolic regression (SR) has focused on artificially constructed search problems where there is zero noise in the data. Such problems admit of exact solutions but cannot tell us how accurate the search process is in a noisy real world domain. To explore this question symbolic regression is applied here to an area of research which has been well-travelled by regression modelers: the prediction of unemployment rates. A respected dataset was selected, the CEP-OECD Labor Market Institutions Database, to provide a testing environment for a variety of searches. Metrics of success for this paper went beyond the normal yardsticks of statistical significance to demand “plausibility”. Here it is assumed that a plausible model must be able to predict unemployment rates out of the sample period for six future years: this metric is referred to as the “out of sample R²”. We conclude that the two packages tested, Eureqa and ARC, can produce models that go beyond the power of traditional stepwise regression. ARC, in particular, is able to replicate the format of published economic research because ARC contains a high level Regression Query Language (RQL). This research produced a number of models that are consistent with published economic research, have in sample R² values over 0.80, no negative unemployment rates, and out of sample R² values above 0.45. It is argued that SR offers significant new advantages to social science researchers.

Philip Truscott, Michael F. Korns

Chapter 8. Uniform Linear Transformation with Repair and Alternation in Genetic Programming

Abstract

Several genetic programming researchers have argued for the utility of genetic operators that act uniformly. By “act uniformly” we mean two specific things: that the probability of an inherited program component being modified during inheritance is independent of the size and shape of the parent programs beyond the component in question; and that pairs of parents are combined in ways that allow arbitrary combinations of components from each parent to appear in the child. Uniform operators described in previous work have had limited utility, however, because of a mismatch between the relevant notions of uniformity and the hierarchical structure and variable sizes of many genetic programming representations. In this chapter we describe a new genetic operator, ULTRA, which incorporates aspects of both mutation and crossover and acts approximately uniformly across programs of variable sizes and structures. ULTRA treats hierarchical programs as linear sequences and includes a repair step to ensure that syntax constraints are satisfied after variation. We show that on the drug bioavailability and Pagie-1 benchmark problems ULTRA produces significant improvements both in problem-solving power and in program size relative to standard operators. Experiments with factorial regression and with the boolean 6-multiplexer problem demonstrate that ULTRA can manipulate programs that make use of hierarchical structure, but also that it is not always beneficial. The demonstrations evolve programs in the Push programming language, which makes repair particularly simple, but versions of the technique should be applicable in other genetic programming systems as well.

Lee Spector, Thomas Helmuth

Chapter 9. A Deterministic and Symbolic Regression Hybrid Applied to Resting-State fMRI Data

Abstract

Symbolic regression (SR) is one the most popular applications of genetic programming (GP) and an attractive alternative to the standard deterministic regression approaches due to its flexibility in generating free-form mathematical models from observed data without any domain knowledge. However, GP suffers from various issues hindering the applicability of the technique to real-life problems. In this paper, we show that a hybrid deterministic regression (DR)/genetic programming based symbolic regression (GP-SR) algorithm outperforms GP-SR alone on a brain imaging dataset.

Ilknur Icke, Nicholas A. Allgaier, Christopher M. Danforth, Robert A. Whelan, Hugh P. Garavan, Joshua C. Bongard

Chapter 10. Gaining Deeper Insights in Symbolic Regression

Abstract

A distinguishing feature of symbolic regression using genetic programming is its ability to identify complex nonlinear white-box models. This is especially relevant in practice where models are extensively scrutinized in order to gain knowledge about underlying processes. This potential is often diluted by the ambiguity and complexity of the models produced by genetic programming. In this contribution we discuss several analysis methods with the common goal to enable better insights in the symbolic regression process and to produce models that are more understandable and show better generalization. In order to gain more information about the process we monitor and analyze the progresses of population diversity, building block information, and even more general genealogy information. Regarding the analysis of results, several aspects such as model simplification, relevance of variables, node impacts, and variable network analysis are presented and discussed.

Michael Affenzeller, Stephan M. Winkler, Gabriel Kronberger, Michael Kommenda, Bogdan Burlacu, Stefan Wagner

Chapter 11. Geometric Semantic Genetic Programming for Real Life Applications

Abstract

In a recent contribution we have introduced a new implementation of geometric semantic operators for Genetic Programming. Thanks to this implementation, we are now able to deeply investigate their usefulness and study their properties on complex real-life applications. Our experiments confirm that these operators are more effective than traditional ones in optimizing training data, due to the fact that they induce a unimodal fitness landscape. Furthermore, they automatically limit overfitting, something we had already noticed in our recent contribution, and that is further discussed here. Finally, we investigate the influence of some parameters on the effectiveness of these operators, and we show that tuning their values and setting them “a priori” may be wasted effort. Instead, if we randomly modify the values of those parameters several times during the evolution, we obtain a performance that is comparable with the one obtained with the best setting, both on training and test data for all the studied problems.

Leonardo Vanneschi, Sara Silva, Mauro Castelli, Luca Manzoni

Chapter 12. Evaluation of Parameter Contribution to Neural Network Size and Fitness in ATHENA for Genetic Analysis

Abstract

The vast amount of available genomics data provides us an unprecedented ability to survey the entire genome and search for the genetic determinants of complex diseases. Until now, Genome-wide association studies have been the predominant method to associate DNA variations to disease traits. GWAS have successfully uncovered many genetic variants associated with complex diseases when the effect loci are strongly associated with the trait. However, methods for studying interaction effects among multiple loci are still lacking. Established machine learning methods such as the grammatical evolution neural networks (GENN) can be adapted to help us uncover the missing interaction effects that are not captured by GWAS studies. We used an implementation of GENN distributed in the software package ATHENA (Analysis Tool for Heritable and Environmental Network Associations) to investigate the effects of multiple GENN parameters and data noise levels on model detection and network structure. We concluded that the models produced by GENN were greatly affected by algorithm parameters and data noise levels. We also produced complex, multi-layer networks that were not produced in the previous study. In summary, GENN can produce complex, multi-layered networks when the data require it for higher fitness and when the parameter settings allow for a wide search of the complex model space.

Ruowang Li, Emily R. Holzinger, Scott M. Dudek, Marylyn D. Ritchie

Backmatter

Titel: Genetic Programming Theory and Practice XI
herausgegeben von: Rick Riolo
Jason H. Moore
Mark Kotanchek
Verlag: Springer New York
Electronic ISBN: 978-1-4939-0375-7
Print ISBN: 978-1-4939-0374-0
DOI: https://doi.org/10.1007/978-1-4939-0375-7