research-article

Public Access

Detecting flaky tests in probabilistic and machine learning applications

Authors:
Saikat Dutta

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
August Shi

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Rutvik Choudhary

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Zhekun Zhang

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Aryaman Jain

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Sasa Misailovic

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and AnalysisJuly 2020Pages 211–224https://doi.org/10.1145/3395363.3397366

Published:18 July 2020Publication History

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 211–224

ABSTRACT

Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.

In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.

Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

References

AllenNLP Commit 089d744 2019. https://github.com/allenai/allennlp/pull/2778/ commits/089d744.Google Scholar
AllenNLP Commit 53bba3d 2018. https://github.com/allenai/allennlp/commit/ 53bba3d.Google Scholar
AllenNLP Issue 727 2018. https://github.com/allenai/allennlp/pull/727.Google Scholar
American Fuzzy Loop 2014. http://lcamtuf.coredump.cx/afl.Google Scholar
Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering ( 2015 ).Google Scholar
M. Bates. 1995. Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America ( 1995 ).Google ScholarCross Ref
Matthew James Beal. 2003. Variational algorithms for approximate Bayesian inference.Google Scholar
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.Google Scholar
Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. 2019. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research ( 2019 ).Google Scholar
Bob Carpenter, Andrew Gelman, Matt Hofman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, et al. 2016. Stan: A probabilistic programming language. JSTATSOFT ( 2016 ).Google Scholar
Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes Borgström. 2013. Bayesian Inference Using Data Flow Analysis. In ESEC/FSE.Google Scholar
Cleverhans Commit 58505ce 2017. https://github.com/tensorflow/cleverhans/ pull/149/commits/58505ce.Google Scholar
Cleverhans Issue 167 2017. https://github.com/tensorflow/cleverhans/issues/167.Google Scholar
Conda package management system 2017. https://docs.conda.io.Google Scholar
Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair. arXiv: 1912. 03197 [cs.SE]Google Scholar
Marco Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Vikash K. Mansinghka, and Martin Vechev. 2018. Incremental Inference for Probabilistic Programs. In PLDI.Google Scholar
Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hofman, and Rif A Saurous. 2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604 ( 2017 ).Google Scholar
Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In ESEC/FSE.Google Scholar
Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic. 2019. Storm: program reduction for testing and debugging probabilistic programming systems. In ESEC/FSE.Google Scholar
Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In ISSTA.Google Scholar
Eric Jang. Why Randomness is Important for Deep Learning 2016. https: //blog.evjang.com/ 2016 /07/randomness-deep-learning. html.Google Scholar
Flaky test plugin 2019. https://github.com/box/flaky.Google Scholar
Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. An Introduction to Deep Reinforcement Learning. arXiv: 1811. 12560 [cs.LG]Google Scholar
Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical Test Dependency Detection. In ICST.Google Scholar
Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. In CAV.Google Scholar
Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian data analysis.Google Scholar
John Geweke et al. 1991. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN.Google Scholar
Wally R Gilks, Andrew Thomas, and David J Spiegelhalter. 1994. A language and program for complex Bayesian modelling. The Statistician ( 1994 ).Google Scholar
Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 ( 2012 ).Google Scholar
Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages. http://dippl.org.Google Scholar
GPytorch Pull Request 373 2018. https://github.com/cornellius-gp/gpytorch/ pull/373.Google Scholar
Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In SCAM.Google Scholar
Jason Brownlee. Embrace Randomness in Machine Learning 2019. https:// machinelearningmastery.com /randomness-in-machine-learning/.Google Scholar
Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical algorithmic profiling for randomized approximate programs. In ICSE.Google Scholar
Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo A. Martin. 2019. ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software ( 2019 ).Google Scholar
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In ISSTA.Google Scholar
Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In ICSE.Google Scholar
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In ICST.Google Scholar
Caroline Lemieux, Rohan Padhye, Koushik Sen, and Dawn Song. 2018. PerfFuzz: Automatically Generating Pathological Inputs. In ISSTA.Google Scholar
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.Google Scholar
Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint 1404.0099 ( 2014 ).Google Scholar
T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2013. Infer.NET 2.5. Microsoft Research Cambridge. http://research.microsoft.com/infernet.Google Scholar
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ).Google Scholar
Radford M Neal et al. 2011. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo ( 2011 ).Google Scholar
Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In ASE.Google Scholar
Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. 2014. R2: An eficient MCMC sampler for probabilistic programs. In AAAI.Google Scholar
Akira K Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. 1998. Regression testing in an industrial environment. Commun. ACM ( 1998 ).Google Scholar
Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In ISSTA DEMO.Google Scholar
Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In ISSTA.Google Scholar
Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. Proc. ACM Program. Lang. OOPSLA ( 2019 ).Google ScholarDigital Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.Google Scholar
Avi Pfefer. 2001. IBAL: a probabilistic rational programming language. In IJCAI.Google Scholar
Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In ICSE.Google Scholar
PyroWebPage 2018. Pyro. http://pyro.ai.Google Scholar
PySyft Issue 1399 2018. https://github.com/OpenMined/PySyft/pull/1399.Google Scholar
Adrian E Raftery and Steven M Lewis. 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo ( 1995 ).Google Scholar
Raster Vision Issue 285 2018. https://github.com/azavea/raster-vision/issues/285.Google Scholar
John A Rice. 2006. Mathematical statistics and data analysis.Google Scholar
John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science ( 2016 ).Google Scholar
Simone Scardapane and Dianhui Wang. 2017. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery ( 2017 ).Google Scholar
Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks ( 2015 ).Google Scholar
August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications. In ICST.Google Scholar
August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.Google Scholar
TensorFlowWebPage 2018. TensorFlow. https://www.tensorflow.org.Google Scholar
Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In ICSME.Google Scholar
Dustin Tran, Matthew D. Hofman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In ICLR.Google Scholar
Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv ( 2016 ).Google Scholar
Abraham Wald. 1945. Sequential tests of statistical hypotheses. The annals of mathematical statistics ( 1945 ).Google Scholar
Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In AISTATS.Google Scholar
Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification, and Reliability ( 2012 ).Google Scholar
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. arXiv: 1906. 10742 [cs.LG]Google Scholar
Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning. National Science Review ( 2017 ).Google Scholar

Index Terms

Detecting flaky tests in probabilistic and machine learning applications
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Read More
Root causing flaky tests in a large-scale industrial setting
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

In today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—...
Read More
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2020
591 pages
ISBN:9781450380089
DOI:10.1145/3395363
General Chair:
Sarfraz Khurshid
University of Texas at Austin, USA
,
Program Chair:
Corina S. Păsăreanu
Carnegie Mellon University Silicon Valley / NASA Ames Research Center, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Flaky tests
Machine Learning
Non-Determinism
Probabilistic Programming
Randomness
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate58of213submissions,27%
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 831
  Total Downloads
- Downloads (Last 12 months)208
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting flaky tests in probabilistic and machine learning applications

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Survey of Flaky Tests

Root causing flaky tests in a large-scale industrial setting

An empirical analysis of flaky tests

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting flaky tests in probabilistic and machine learning applications

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Survey of Flaky Tests

Root causing flaky tests in a large-scale industrial setting

An empirical analysis of flaky tests

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media