Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

Authors:
Randal S. Olson

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA
View Profile

,
Nathan Bartley

University of Chicago, Chicago, IL, USA

University of Chicago, Chicago, IL, USA
View Profile

,
Ryan J. Urbanowicz

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA
View Profile

,
Jason H. Moore

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA
View Profile

GECCO '16: Proceedings of the Genetic and Evolutionary Computation Conference 2016July 2016Pages 485–492https://doi.org/10.1145/2908812.2908918

Published:20 July 2016Publication History

GECCO '16: Proceedings of the Genetic and Evolutionary Computation Conference 2016

Pages 485–492

ABSTRACT

As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning--pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.

References

W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo, CA, 1998. Google ScholarDigital Library
J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281--305, 2012. Google ScholarDigital Library
K. Deb et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182--197, 2002. Google ScholarDigital Library
M. Feurer et al. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, pages 2944--2952. Curran Associates, Inc., 2015. Google ScholarDigital Library
S. Forrest et al. A genetic programming approach to automated software repair. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO '09, pages 947--954, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
F.-A. Fortin et al. DEAP: Evolutionary Algorithms Made Easy. Journal of Machine Learning Research, 13:2171--2175, 2012. Google ScholarDigital Library
E. M. Fredericks and B. H. Cheng. Exploring automated software composition with genetic programming. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '13 Companion, pages 1733--1734, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
T. J. Hastie et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, USA, 2009.Google Scholar
G. S. Hornby et al. Computer-automated evolution of an X-band antenna for NASA's Space Technology 5 mission. Evolutionary Computation, 19:1--23, 2011. Google ScholarDigital Library
F. Hutter, J. Lücke, and L. Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29:329--337, 2015.Google Scholar
J. M. Kanter and K. Veeramachaneni. Deep Feature Synthesis: Towards Automating Data Science Endeavors. In Proceedings of the International Conference on Data Science and Advance Analytics. IEEE, 2015.Google ScholarCross Ref
M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google Scholar
P. G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30:47--68, 2011.Google ScholarCross Ref
R. S. Olson et al. Applications of Evolutionary Computation: 19th European Conference, EvoApplications, chapter Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, pages 123--137. Springer International Publishing, 2016.Google ScholarCross Ref
F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarDigital Library
RJMetrics. The State of Data Science, Feb. 2016. https://rjmetrics.com/resources/reports/the-state-of-data-science/.Google Scholar
J. Snoek et al. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25, pages 2951--2959. Curran Associates, Inc., 2012.Google ScholarDigital Library
L. Spector et al. Genetic programming for finite algebras. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO '08, pages 1291--1298, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
R. J. Urbanowicz et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining, 5, 2012.Google Scholar
R. J. Urbanowicz et al. Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Mining, 5:1--13, 2012.Google ScholarCross Ref
D. R. Velez et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology, 31(4):306--315, 2007.Google ScholarCross Ref
J. Zutty et al. Multiple objective vector-based genetic programming using human-derived primitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO '15, pages 1127--1134, New York, NY, USA, 2015. ACM. Google ScholarDigital Library

Index Terms

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Bio-inspired approaches
        Genetic programming
2. Software and its engineering
  1. Software creation and management
    1. Search-based software engineering

Recommendations

Deep Pipeline Embeddings for AutoML
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Automated Machine Learning (AutoML) is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. The core technical challenge behind AutoML is optimizing the pipelines of Machine ...
Read More
Extending Tree-Based Automated Machine Learning to Biomedical Image and Text Data Using Custom Feature Extractors
GECCO '23 Companion: Proceedings of the Companion Conference on Genetic and Evolutionary Computation

Automated machine learning (AutoML) has allowed for many innovations in biomedical data science; however, most AutoML approaches do not support image or text data. To rectify this, we implemented four feature extractors in the Tree-based Pipeline ...
Read More
TPOT-NN: augmenting tree-based automated machine learning with neural network estimators
Abstract
Automated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO '16: Proceedings of the Genetic and Evolutionary Computation Conference 2016
July 2016
1196 pages
ISBN:9781450342063
DOI:10.1145/2908812
Editor:
Tobias Friedrich
Hasso Plattner Institute
,
General Chair:
Frank Neumann
University of Adelaide
,
Program Chair:
Andrew M. Sutton
Hasso Plattner Institute
Copyright © 2016 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Pareto optimization
data science
genetic programming
hyperparameter optimization
machine learning
pipeline optimization
python
Qualifiers
- research-article
Conference

Acceptance Rates
GECCO '16 Paper Acceptance Rate137of381submissions,36%Overall Acceptance Rate1,669of4,410submissions,38%
More
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 305
  Total Citations
  View Citations
- 8,320
  Total Downloads
- Downloads (Last 12 months)846
- Downloads (Last 6 weeks)118
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

GECCO '16: Proceedings of the Genetic and Evolutionary Computation Conference 2016

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep Pipeline Embeddings for AutoML

Extending Tree-Based Automated Machine Learning to Biomedical Image and Text Data Using Custom Feature Extractors

TPOT-NN: augmenting tree-based automated machine learning with neural network estimators