ABSTRACT
As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning--pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.
- W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo, CA, 1998. Google ScholarDigital Library
- J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281--305, 2012. Google ScholarDigital Library
- K. Deb et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182--197, 2002. Google ScholarDigital Library
- M. Feurer et al. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, pages 2944--2952. Curran Associates, Inc., 2015. Google ScholarDigital Library
- S. Forrest et al. A genetic programming approach to automated software repair. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO '09, pages 947--954, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- F.-A. Fortin et al. DEAP: Evolutionary Algorithms Made Easy. Journal of Machine Learning Research, 13:2171--2175, 2012. Google ScholarDigital Library
- E. M. Fredericks and B. H. Cheng. Exploring automated software composition with genetic programming. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '13 Companion, pages 1733--1734, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- T. J. Hastie et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, USA, 2009.Google Scholar
- G. S. Hornby et al. Computer-automated evolution of an X-band antenna for NASA's Space Technology 5 mission. Evolutionary Computation, 19:1--23, 2011. Google ScholarDigital Library
- F. Hutter, J. Lücke, and L. Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29:329--337, 2015.Google Scholar
- J. M. Kanter and K. Veeramachaneni. Deep Feature Synthesis: Towards Automating Data Science Endeavors. In Proceedings of the International Conference on Data Science and Advance Analytics. IEEE, 2015.Google ScholarCross Ref
- M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google Scholar
- P. G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30:47--68, 2011.Google ScholarCross Ref
- R. S. Olson et al. Applications of Evolutionary Computation: 19th European Conference, EvoApplications, chapter Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, pages 123--137. Springer International Publishing, 2016.Google ScholarCross Ref
- F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarDigital Library
- RJMetrics. The State of Data Science, Feb. 2016. https://rjmetrics.com/resources/reports/the-state-of-data-science/.Google Scholar
- J. Snoek et al. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25, pages 2951--2959. Curran Associates, Inc., 2012.Google ScholarDigital Library
- L. Spector et al. Genetic programming for finite algebras. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO '08, pages 1291--1298, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. J. Urbanowicz et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining, 5, 2012.Google Scholar
- R. J. Urbanowicz et al. Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Mining, 5:1--13, 2012.Google ScholarCross Ref
- D. R. Velez et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology, 31(4):306--315, 2007.Google ScholarCross Ref
- J. Zutty et al. Multiple objective vector-based genetic programming using human-derived primitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO '15, pages 1127--1134, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Index Terms
- Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
Recommendations
Deep Pipeline Embeddings for AutoML
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAutomated Machine Learning (AutoML) is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. The core technical challenge behind AutoML is optimizing the pipelines of Machine ...
Extending Tree-Based Automated Machine Learning to Biomedical Image and Text Data Using Custom Feature Extractors
GECCO '23 Companion: Proceedings of the Companion Conference on Genetic and Evolutionary ComputationAutomated machine learning (AutoML) has allowed for many innovations in biomedical data science; however, most AutoML approaches do not support image or text data. To rectify this, we implemented four feature extractors in the Tree-based Pipeline ...
TPOT-NN: augmenting tree-based automated machine learning with neural network estimators
AbstractAutomated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their ...
Comments