skip to main content
10.1145/2908812.2908918acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article
Open Access

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

Published:20 July 2016Publication History

ABSTRACT

As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning--pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.

References

  1. W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo, CA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281--305, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Deb et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182--197, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Feurer et al. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, pages 2944--2952. Curran Associates, Inc., 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Forrest et al. A genetic programming approach to automated software repair. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO '09, pages 947--954, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F.-A. Fortin et al. DEAP: Evolutionary Algorithms Made Easy. Journal of Machine Learning Research, 13:2171--2175, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. M. Fredericks and B. H. Cheng. Exploring automated software composition with genetic programming. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '13 Companion, pages 1733--1734, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. J. Hastie et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, USA, 2009.Google ScholarGoogle Scholar
  9. G. S. Hornby et al. Computer-automated evolution of an X-band antenna for NASA's Space Technology 5 mission. Evolutionary Computation, 19:1--23, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Hutter, J. Lücke, and L. Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29:329--337, 2015.Google ScholarGoogle Scholar
  11. J. M. Kanter and K. Veeramachaneni. Deep Feature Synthesis: Towards Automating Data Science Endeavors. In Proceedings of the International Conference on Data Science and Advance Analytics. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  13. P. G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30:47--68, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. S. Olson et al. Applications of Evolutionary Computation: 19th European Conference, EvoApplications, chapter Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, pages 123--137. Springer International Publishing, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  15. F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. RJMetrics. The State of Data Science, Feb. 2016. https://rjmetrics.com/resources/reports/the-state-of-data-science/.Google ScholarGoogle Scholar
  17. J. Snoek et al. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25, pages 2951--2959. Curran Associates, Inc., 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Spector et al. Genetic programming for finite algebras. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO '08, pages 1291--1298, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. J. Urbanowicz et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining, 5, 2012.Google ScholarGoogle Scholar
  20. R. J. Urbanowicz et al. Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Mining, 5:1--13, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. D. R. Velez et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology, 31(4):306--315, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Zutty et al. Multiple objective vector-based genetic programming using human-derived primitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO '15, pages 1127--1134, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader