skip to main content
10.1145/3472674.3473981acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Toward static test flakiness prediction: a feasibility study

Published:23 August 2021Publication History

ABSTRACT

Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While the research community has attempted to define automated approaches for detecting and addressing test flakiness, most of them suffer from scalability issues and uncertainty as they require test cases to be run multiple times. This limitation has been recently targeted by means of machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify the extent to which the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness.

References

  1. A. Alshammari, C. Morris, M. Hilton, and J. Bell. 2021. Flakeflagger: Predicting flakiness without rerunning tests. In ICSE 2021. 1572–1584. https://doi.org/10.1109/ICSE43902.2021.00140 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE 2018. 433–444. https://doi.org/10.1145/3180155.3180164 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Camara, M. Silva, A. Endo, and S. Vergilio. 2021. What is the Vocabulary of Flaky Tests? An Extended Replication. arXiv preprint arXiv:2103.12670.Google ScholarGoogle Scholar
  4. G. Catolino, F. Palomba, A. Zaidman, and F. Ferrucci. 2019. How the experience of development teams relates to assertion density of test classes. In ICSME 2019. 223–234. https://doi.org/10.1109/ICSME.2019.00034 Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Chidamber and C. Kemerer. 1994. A metrics suite for object oriented design. IEEE TSE, 20, 6 (1994), 476–493. https://doi.org/10.1109/32.295895 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Daniel, V. Jagannath, D. Dig, and D. Marinov. 2009. ReAssert: Suggesting repairs for broken unit tests. In ASE 2009. 433–444. https://doi.org/10.1109/ASE.2009.17 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. In ESEC/FSE 2019. 830–840. https://doi.org/10.1145/3338906.3338945 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Fowler. 2011. Eradicating non-determinism in tests. Martin Fowler Personal Blog.Google ScholarGoogle Scholar
  9. M. Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional. isbn:9788131734667Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Garson. 2012. Testing statistical assumptions. Asheboro, NC: Statistical Associates Publishing.Google ScholarGoogle Scholar
  11. G. Grano, C. De Iaco, F. Palomba, and H. Gall. 2020. Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality. In ICSME 2020. 336–347. https://doi.org/10.1109/ICSME46990.2020.00040 Google ScholarGoogle ScholarCross RefCross Ref
  12. G. Grano, F. Palomba, and H. Gall. 2019. Lightweight assessment of test-case effectiveness using source-code-quality indicators. IEEE TSE, https://doi.org/10.1109/TSE.2019.2903057 Google ScholarGoogle ScholarCross RefCross Ref
  13. G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon. 2021. A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests. In MSR 2021. https://doi.org/10.1109/MSR52588.2021.00034 Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Han, M. Kamber, and J. Pei. 2011. Data mining concepts and techniques third edition. The Morgan Kaufmann Series in Data Management Systems, 5, 4 (2011), 83–124.Google ScholarGoogle Scholar
  15. F. Lacoste. 2009. Killing the gatekeeper: Introducing a continuous integration system. In 2009 agile conference. 387–392. https://doi.org/10.1109/AGILE.2009.35 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST 2019. 312–322. https://doi.org/10.1109/ICST.2019.00038 Google ScholarGoogle ScholarCross RefCross Ref
  17. W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In ISSRE 2020. 403–413. https://doi.org/10.1109/ISSRE5003.2020.00045 Google ScholarGoogle ScholarCross RefCross Ref
  18. W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell. 2020. A large-scale longitudinal study of flaky tests. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), 1–29. https://doi.org/10.1145/3428270 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. 2014. An empirical analysis of flaky tests. In ESEC/FSE 2014. 643–653. https://doi.org/10.1145/2635868.2635920 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. McCabe. 1976. A Complexity Measure. IEEE TSE, SE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Memon and M. Cohen. 2013. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE 2013. 1479–1480. https://doi.org/10.1109/ICSE.2013.6606750 Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Micco. 2017. The state of continuous integration testing@ Google.Google ScholarGoogle Scholar
  23. N. Moha, Y. Guéhéneuc, L. Duchien, and A. Le Meur. 2009. Decor: A method for the specification and detection of code and design smells. IEEE TSE, 36, 1 (2009), 20–36. https://doi.org/10.1109/TSE.2009.50 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Murillo-Morera and M. Jenkins. 2015. A Software Defect-Proneness Prediction Framework: A new approach using genetic algorithms to generate learning schemes. In SEKE. 445–450. https://doi.org/10.18293/SEKE2015-099 Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Nelder and R. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 3 (1972), 370–384. https://doi.org/10.2307/2344614 Google ScholarGoogle ScholarCross RefCross Ref
  26. R. O’brien. 2007. A caution regarding rules of thumb for variance inflation factors. Quality & quantity, 41, 5 (2007), 673–690. https://doi.org/10.1007/s11135-006-9018-6 Google ScholarGoogle ScholarCross RefCross Ref
  27. F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering, 23, 3 (2018), 1188–1221. https://doi.org/10.1145/3180155.3182532 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Pecorelli, G. Di Lillo, F. Palomba, and A. De Lucia. 2020. VITRuM: A Plug-In for the Visualization of Test-Related Metrics. In AVI 2020. 1–3. https://doi.org/10.1145/3399715.3399954 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. Pecorelli, F. Palomba, and A. De Lucia. 2021. The Relation of Test-Related Factors to Software Quality: A Case Study on Apache Systems. Empirical Software Engineering, 26, 2 (2021), https://doi.org/10.1007/s10664-020-09891-y Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Perez, R. Abreu, and A. van Deursen. 2017. A test-suite diagnosability metric for spectrum-based fault localization approaches. In ICSE 2017. 654–664. https://doi.org/10.1109/ICSE.2017.66 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Pezze and M. Young. 2008. Software testing and analysis: process, principles, and techniques. John Wiley & Sons.Google ScholarGoogle Scholar
  32. G. Pinto, B. Miranda, S. Dissanayake, M. D’Amorim, C. Treude, and A. Bertolino. 2020. What is the vocabulary of flaky tests? In MSR 2020. 492–502. https://doi.org/10.1145/3379597.3387482 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Pontillo, F. Palomba, and F. Ferrucci. 2021. Toward Static Test Flakiness Prediction: A Feasibility Study. https://doi.org/10.6084/m9.figshare.14645895.v3 Google ScholarGoogle ScholarCross RefCross Ref
  34. V. Terragni, P. Salza, and F. Ferrucci. 2020. A container-based infrastructure for fuzzy-driven root causing of flaky tests. In ICSE 2020. 69–72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Thorve, C. Sreshtha, and N. Meng. 2018. An empirical study of flaky tests in android apps. In ICSME 2018. 534–538. https://doi.org/10.1109/ICSME.2018.00062 Google ScholarGoogle ScholarCross RefCross Ref
  36. A. van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok. 2001. Refactoring test code. In XP 2001. 92–95.Google ScholarGoogle Scholar
  37. R. Verdecchia, E. Cruciani, B. Miranda, and A. Bertolino. 2021. Know You Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access, 9 (2021), 76119–76134. https://doi.org/10.1109/ACCESS.2021.3082424 Google ScholarGoogle ScholarCross RefCross Ref
  38. S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. Ernst, and D. Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA 2014. 385–396. https://doi.org/10.1145/2610384.2610404 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Toward static test flakiness prediction: a feasibility study

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution
        August 2021
        36 pages
        ISBN:9781450386258
        DOI:10.1145/3472674

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 August 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        FSE '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader