ABSTRACT
Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.
In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.
Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.
- AllenNLP Commit 089d744 2019. https://github.com/allenai/allennlp/pull/2778/ commits/089d744.Google Scholar
- AllenNLP Commit 53bba3d 2018. https://github.com/allenai/allennlp/commit/ 53bba3d.Google Scholar
- AllenNLP Issue 727 2018. https://github.com/allenai/allennlp/pull/727.Google Scholar
- American Fuzzy Loop 2014. http://lcamtuf.coredump.cx/afl.Google Scholar
- Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering ( 2015 ).Google Scholar
- M. Bates. 1995. Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America ( 1995 ).Google ScholarCross Ref
- Matthew James Beal. 2003. Variational algorithms for approximate Bayesian inference.Google Scholar
- Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.Google Scholar
- Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. 2019. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research ( 2019 ).Google Scholar
- Bob Carpenter, Andrew Gelman, Matt Hofman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, et al. 2016. Stan: A probabilistic programming language. JSTATSOFT ( 2016 ).Google Scholar
- Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes Borgström. 2013. Bayesian Inference Using Data Flow Analysis. In ESEC/FSE.Google Scholar
- Cleverhans Commit 58505ce 2017. https://github.com/tensorflow/cleverhans/ pull/149/commits/58505ce.Google Scholar
- Cleverhans Issue 167 2017. https://github.com/tensorflow/cleverhans/issues/167.Google Scholar
- Conda package management system 2017. https://docs.conda.io.Google Scholar
- Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair. arXiv: 1912. 03197 [cs.SE]Google Scholar
- Marco Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Vikash K. Mansinghka, and Martin Vechev. 2018. Incremental Inference for Probabilistic Programs. In PLDI.Google Scholar
- Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hofman, and Rif A Saurous. 2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604 ( 2017 ).Google Scholar
- Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In ESEC/FSE.Google Scholar
- Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic. 2019. Storm: program reduction for testing and debugging probabilistic programming systems. In ESEC/FSE.Google Scholar
- Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In ISSTA.Google Scholar
- Eric Jang. Why Randomness is Important for Deep Learning 2016. https: //blog.evjang.com/ 2016 /07/randomness-deep-learning. html.Google Scholar
- Flaky test plugin 2019. https://github.com/box/flaky.Google Scholar
- Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. An Introduction to Deep Reinforcement Learning. arXiv: 1811. 12560 [cs.LG]Google Scholar
- Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical Test Dependency Detection. In ICST.Google Scholar
- Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. In CAV.Google Scholar
- Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian data analysis.Google Scholar
- John Geweke et al. 1991. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN.Google Scholar
- Wally R Gilks, Andrew Thomas, and David J Spiegelhalter. 1994. A language and program for complex Bayesian modelling. The Statistician ( 1994 ).Google Scholar
- Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 ( 2012 ).Google Scholar
- Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages. http://dippl.org.Google Scholar
- GPytorch Pull Request 373 2018. https://github.com/cornellius-gp/gpytorch/ pull/373.Google Scholar
- Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In SCAM.Google Scholar
- Jason Brownlee. Embrace Randomness in Machine Learning 2019. https:// machinelearningmastery.com /randomness-in-machine-learning/.Google Scholar
- Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical algorithmic profiling for randomized approximate programs. In ICSE.Google Scholar
- Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo A. Martin. 2019. ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software ( 2019 ).Google Scholar
- Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In ISSTA.Google Scholar
- Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In ICSE.Google Scholar
- Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In ICST.Google Scholar
- Caroline Lemieux, Rohan Padhye, Koushik Sen, and Dawn Song. 2018. PerfFuzz: Automatically Generating Pathological Inputs. In ISSTA.Google Scholar
- Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.Google Scholar
- Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint 1404.0099 ( 2014 ).Google Scholar
- T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2013. Infer.NET 2.5. Microsoft Research Cambridge. http://research.microsoft.com/infernet.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ).Google Scholar
- Radford M Neal et al. 2011. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo ( 2011 ).Google Scholar
- Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In ASE.Google Scholar
- Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. 2014. R2: An eficient MCMC sampler for probabilistic programs. In AAAI.Google Scholar
- Akira K Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. 1998. Regression testing in an industrial environment. Commun. ACM ( 1998 ).Google Scholar
- Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In ISSTA DEMO.Google Scholar
- Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In ISSTA.Google Scholar
- Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. Proc. ACM Program. Lang. OOPSLA ( 2019 ).Google ScholarDigital Library
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.Google Scholar
- Avi Pfefer. 2001. IBAL: a probabilistic rational programming language. In IJCAI.Google Scholar
- Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In ICSE.Google Scholar
- PyroWebPage 2018. Pyro. http://pyro.ai.Google Scholar
- PySyft Issue 1399 2018. https://github.com/OpenMined/PySyft/pull/1399.Google Scholar
- Adrian E Raftery and Steven M Lewis. 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo ( 1995 ).Google Scholar
- Raster Vision Issue 285 2018. https://github.com/azavea/raster-vision/issues/285.Google Scholar
- John A Rice. 2006. Mathematical statistics and data analysis.Google Scholar
- John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science ( 2016 ).Google Scholar
- Simone Scardapane and Dianhui Wang. 2017. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery ( 2017 ).Google Scholar
- Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks ( 2015 ).Google Scholar
- August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications. In ICST.Google Scholar
- August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.Google Scholar
- TensorFlowWebPage 2018. TensorFlow. https://www.tensorflow.org.Google Scholar
- Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In ICSME.Google Scholar
- Dustin Tran, Matthew D. Hofman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In ICLR.Google Scholar
- Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv ( 2016 ).Google Scholar
- Abraham Wald. 1945. Sequential tests of statistical hypotheses. The annals of mathematical statistics ( 1945 ).Google Scholar
- Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In AISTATS.Google Scholar
- Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification, and Reliability ( 2012 ).Google Scholar
- Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. arXiv: 1906. 10742 [cs.LG]Google Scholar
- Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning. National Science Review ( 2017 ).Google Scholar
Index Terms
- Detecting flaky tests in probabilistic and machine learning applications
Recommendations
A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Root causing flaky tests in a large-scale industrial setting
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and AnalysisIn today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software EngineeringRegression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Comments