skip to main content
10.1145/3395363.3397366acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Public Access

Detecting flaky tests in probabilistic and machine learning applications

Published:18 July 2020Publication History

ABSTRACT

Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.

In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.

Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

References

  1. AllenNLP Commit 089d744 2019. https://github.com/allenai/allennlp/pull/2778/ commits/089d744.Google ScholarGoogle Scholar
  2. AllenNLP Commit 53bba3d 2018. https://github.com/allenai/allennlp/commit/ 53bba3d.Google ScholarGoogle Scholar
  3. AllenNLP Issue 727 2018. https://github.com/allenai/allennlp/pull/727.Google ScholarGoogle Scholar
  4. American Fuzzy Loop 2014. http://lcamtuf.coredump.cx/afl.Google ScholarGoogle Scholar
  5. Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering ( 2015 ).Google ScholarGoogle Scholar
  6. M. Bates. 1995. Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America ( 1995 ).Google ScholarGoogle ScholarCross RefCross Ref
  7. Matthew James Beal. 2003. Variational algorithms for approximate Bayesian inference.Google ScholarGoogle Scholar
  8. Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.Google ScholarGoogle Scholar
  9. Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. 2019. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research ( 2019 ).Google ScholarGoogle Scholar
  10. Bob Carpenter, Andrew Gelman, Matt Hofman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, et al. 2016. Stan: A probabilistic programming language. JSTATSOFT ( 2016 ).Google ScholarGoogle Scholar
  11. Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes Borgström. 2013. Bayesian Inference Using Data Flow Analysis. In ESEC/FSE.Google ScholarGoogle Scholar
  12. Cleverhans Commit 58505ce 2017. https://github.com/tensorflow/cleverhans/ pull/149/commits/58505ce.Google ScholarGoogle Scholar
  13. Cleverhans Issue 167 2017. https://github.com/tensorflow/cleverhans/issues/167.Google ScholarGoogle Scholar
  14. Conda package management system 2017. https://docs.conda.io.Google ScholarGoogle Scholar
  15. Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair. arXiv: 1912. 03197 [cs.SE]Google ScholarGoogle Scholar
  16. Marco Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Vikash K. Mansinghka, and Martin Vechev. 2018. Incremental Inference for Probabilistic Programs. In PLDI.Google ScholarGoogle Scholar
  17. Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hofman, and Rif A Saurous. 2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604 ( 2017 ).Google ScholarGoogle Scholar
  18. Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In ESEC/FSE.Google ScholarGoogle Scholar
  19. Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic. 2019. Storm: program reduction for testing and debugging probabilistic programming systems. In ESEC/FSE.Google ScholarGoogle Scholar
  20. Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In ISSTA.Google ScholarGoogle Scholar
  21. Eric Jang. Why Randomness is Important for Deep Learning 2016. https: //blog.evjang.com/ 2016 /07/randomness-deep-learning. html.Google ScholarGoogle Scholar
  22. Flaky test plugin 2019. https://github.com/box/flaky.Google ScholarGoogle Scholar
  23. Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. An Introduction to Deep Reinforcement Learning. arXiv: 1811. 12560 [cs.LG]Google ScholarGoogle Scholar
  24. Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical Test Dependency Detection. In ICST.Google ScholarGoogle Scholar
  25. Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. In CAV.Google ScholarGoogle Scholar
  26. Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian data analysis.Google ScholarGoogle Scholar
  27. John Geweke et al. 1991. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN.Google ScholarGoogle Scholar
  28. Wally R Gilks, Andrew Thomas, and David J Spiegelhalter. 1994. A language and program for complex Bayesian modelling. The Statistician ( 1994 ).Google ScholarGoogle Scholar
  29. Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 ( 2012 ).Google ScholarGoogle Scholar
  30. Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages. http://dippl.org.Google ScholarGoogle Scholar
  31. GPytorch Pull Request 373 2018. https://github.com/cornellius-gp/gpytorch/ pull/373.Google ScholarGoogle Scholar
  32. Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In SCAM.Google ScholarGoogle Scholar
  33. Jason Brownlee. Embrace Randomness in Machine Learning 2019. https:// machinelearningmastery.com /randomness-in-machine-learning/.Google ScholarGoogle Scholar
  34. Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical algorithmic profiling for randomized approximate programs. In ICSE.Google ScholarGoogle Scholar
  35. Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo A. Martin. 2019. ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software ( 2019 ).Google ScholarGoogle Scholar
  36. Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In ISSTA.Google ScholarGoogle Scholar
  37. Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In ICSE.Google ScholarGoogle Scholar
  38. Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In ICST.Google ScholarGoogle Scholar
  39. Caroline Lemieux, Rohan Padhye, Koushik Sen, and Dawn Song. 2018. PerfFuzz: Automatically Generating Pathological Inputs. In ISSTA.Google ScholarGoogle Scholar
  40. Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.Google ScholarGoogle Scholar
  41. Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint 1404.0099 ( 2014 ).Google ScholarGoogle Scholar
  42. T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2013. Infer.NET 2.5. Microsoft Research Cambridge. http://research.microsoft.com/infernet.Google ScholarGoogle Scholar
  43. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ).Google ScholarGoogle Scholar
  44. Radford M Neal et al. 2011. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo ( 2011 ).Google ScholarGoogle Scholar
  45. Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In ASE.Google ScholarGoogle Scholar
  46. Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. 2014. R2: An eficient MCMC sampler for probabilistic programs. In AAAI.Google ScholarGoogle Scholar
  47. Akira K Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. 1998. Regression testing in an industrial environment. Commun. ACM ( 1998 ).Google ScholarGoogle Scholar
  48. Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In ISSTA DEMO.Google ScholarGoogle Scholar
  49. Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In ISSTA.Google ScholarGoogle Scholar
  50. Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. Proc. ACM Program. Lang. OOPSLA ( 2019 ).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.Google ScholarGoogle Scholar
  52. Avi Pfefer. 2001. IBAL: a probabilistic rational programming language. In IJCAI.Google ScholarGoogle Scholar
  53. Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In ICSE.Google ScholarGoogle Scholar
  54. PyroWebPage 2018. Pyro. http://pyro.ai.Google ScholarGoogle Scholar
  55. PySyft Issue 1399 2018. https://github.com/OpenMined/PySyft/pull/1399.Google ScholarGoogle Scholar
  56. Adrian E Raftery and Steven M Lewis. 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo ( 1995 ).Google ScholarGoogle Scholar
  57. Raster Vision Issue 285 2018. https://github.com/azavea/raster-vision/issues/285.Google ScholarGoogle Scholar
  58. John A Rice. 2006. Mathematical statistics and data analysis.Google ScholarGoogle Scholar
  59. John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science ( 2016 ).Google ScholarGoogle Scholar
  60. Simone Scardapane and Dianhui Wang. 2017. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery ( 2017 ).Google ScholarGoogle Scholar
  61. Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks ( 2015 ).Google ScholarGoogle Scholar
  62. August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications. In ICST.Google ScholarGoogle Scholar
  63. August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.Google ScholarGoogle Scholar
  64. TensorFlowWebPage 2018. TensorFlow. https://www.tensorflow.org.Google ScholarGoogle Scholar
  65. Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In ICSME.Google ScholarGoogle Scholar
  66. Dustin Tran, Matthew D. Hofman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In ICLR.Google ScholarGoogle Scholar
  67. Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv ( 2016 ).Google ScholarGoogle Scholar
  68. Abraham Wald. 1945. Sequential tests of statistical hypotheses. The annals of mathematical statistics ( 1945 ).Google ScholarGoogle Scholar
  69. Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In AISTATS.Google ScholarGoogle Scholar
  70. Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification, and Reliability ( 2012 ).Google ScholarGoogle Scholar
  71. Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. arXiv: 1906. 10742 [cs.LG]Google ScholarGoogle Scholar
  72. Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning. National Science Review ( 2017 ).Google ScholarGoogle Scholar

Index Terms

  1. Detecting flaky tests in probabilistic and machine learning applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
      July 2020
      591 pages
      ISBN:9781450380089
      DOI:10.1145/3395363

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate58of213submissions,27%

      Upcoming Conference

      ISSTA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader