ABSTRACT
Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While the research community has attempted to define automated approaches for detecting and addressing test flakiness, most of them suffer from scalability issues and uncertainty as they require test cases to be run multiple times. This limitation has been recently targeted by means of machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify the extent to which the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness.
- A. Alshammari, C. Morris, M. Hilton, and J. Bell. 2021. Flakeflagger: Predicting flakiness without rerunning tests. In ICSE 2021. 1572–1584. https://doi.org/10.1109/ICSE43902.2021.00140 Google ScholarDigital Library
- J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE 2018. 433–444. https://doi.org/10.1145/3180155.3180164 Google ScholarDigital Library
- B. Camara, M. Silva, A. Endo, and S. Vergilio. 2021. What is the Vocabulary of Flaky Tests? An Extended Replication. arXiv preprint arXiv:2103.12670.Google Scholar
- G. Catolino, F. Palomba, A. Zaidman, and F. Ferrucci. 2019. How the experience of development teams relates to assertion density of test classes. In ICSME 2019. 223–234. https://doi.org/10.1109/ICSME.2019.00034 Google ScholarCross Ref
- S. Chidamber and C. Kemerer. 1994. A metrics suite for object oriented design. IEEE TSE, 20, 6 (1994), 476–493. https://doi.org/10.1109/32.295895 Google ScholarDigital Library
- B. Daniel, V. Jagannath, D. Dig, and D. Marinov. 2009. ReAssert: Suggesting repairs for broken unit tests. In ASE 2009. 433–444. https://doi.org/10.1109/ASE.2009.17 Google ScholarDigital Library
- M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. In ESEC/FSE 2019. 830–840. https://doi.org/10.1145/3338906.3338945 Google ScholarDigital Library
- M. Fowler. 2011. Eradicating non-determinism in tests. Martin Fowler Personal Blog.Google Scholar
- M. Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional. isbn:9788131734667Google ScholarDigital Library
- G. Garson. 2012. Testing statistical assumptions. Asheboro, NC: Statistical Associates Publishing.Google Scholar
- G. Grano, C. De Iaco, F. Palomba, and H. Gall. 2020. Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality. In ICSME 2020. 336–347. https://doi.org/10.1109/ICSME46990.2020.00040 Google ScholarCross Ref
- G. Grano, F. Palomba, and H. Gall. 2019. Lightweight assessment of test-case effectiveness using source-code-quality indicators. IEEE TSE, https://doi.org/10.1109/TSE.2019.2903057 Google ScholarCross Ref
- G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon. 2021. A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests. In MSR 2021. https://doi.org/10.1109/MSR52588.2021.00034 Google ScholarCross Ref
- J. Han, M. Kamber, and J. Pei. 2011. Data mining concepts and techniques third edition. The Morgan Kaufmann Series in Data Management Systems, 5, 4 (2011), 83–124.Google Scholar
- F. Lacoste. 2009. Killing the gatekeeper: Introducing a continuous integration system. In 2009 agile conference. 387–392. https://doi.org/10.1109/AGILE.2009.35 Google ScholarDigital Library
- W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST 2019. 312–322. https://doi.org/10.1109/ICST.2019.00038 Google ScholarCross Ref
- W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In ISSRE 2020. 403–413. https://doi.org/10.1109/ISSRE5003.2020.00045 Google ScholarCross Ref
- W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell. 2020. A large-scale longitudinal study of flaky tests. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), 1–29. https://doi.org/10.1145/3428270 Google ScholarDigital Library
- Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. 2014. An empirical analysis of flaky tests. In ESEC/FSE 2014. 643–653. https://doi.org/10.1145/2635868.2635920 Google ScholarDigital Library
- T. McCabe. 1976. A Complexity Measure. IEEE TSE, SE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837 Google ScholarDigital Library
- A. Memon and M. Cohen. 2013. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE 2013. 1479–1480. https://doi.org/10.1109/ICSE.2013.6606750 Google ScholarCross Ref
- J. Micco. 2017. The state of continuous integration testing@ Google.Google Scholar
- N. Moha, Y. Guéhéneuc, L. Duchien, and A. Le Meur. 2009. Decor: A method for the specification and detection of code and design smells. IEEE TSE, 36, 1 (2009), 20–36. https://doi.org/10.1109/TSE.2009.50 Google ScholarDigital Library
- J. Murillo-Morera and M. Jenkins. 2015. A Software Defect-Proneness Prediction Framework: A new approach using genetic algorithms to generate learning schemes. In SEKE. 445–450. https://doi.org/10.18293/SEKE2015-099 Google ScholarCross Ref
- J. Nelder and R. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 3 (1972), 370–384. https://doi.org/10.2307/2344614 Google ScholarCross Ref
- R. O’brien. 2007. A caution regarding rules of thumb for variance inflation factors. Quality & quantity, 41, 5 (2007), 673–690. https://doi.org/10.1007/s11135-006-9018-6 Google ScholarCross Ref
- F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering, 23, 3 (2018), 1188–1221. https://doi.org/10.1145/3180155.3182532 Google ScholarDigital Library
- F. Pecorelli, G. Di Lillo, F. Palomba, and A. De Lucia. 2020. VITRuM: A Plug-In for the Visualization of Test-Related Metrics. In AVI 2020. 1–3. https://doi.org/10.1145/3399715.3399954 Google ScholarDigital Library
- F. Pecorelli, F. Palomba, and A. De Lucia. 2021. The Relation of Test-Related Factors to Software Quality: A Case Study on Apache Systems. Empirical Software Engineering, 26, 2 (2021), https://doi.org/10.1007/s10664-020-09891-y Google ScholarCross Ref
- A. Perez, R. Abreu, and A. van Deursen. 2017. A test-suite diagnosability metric for spectrum-based fault localization approaches. In ICSE 2017. 654–664. https://doi.org/10.1109/ICSE.2017.66 Google ScholarDigital Library
- M. Pezze and M. Young. 2008. Software testing and analysis: process, principles, and techniques. John Wiley & Sons.Google Scholar
- G. Pinto, B. Miranda, S. Dissanayake, M. D’Amorim, C. Treude, and A. Bertolino. 2020. What is the vocabulary of flaky tests? In MSR 2020. 492–502. https://doi.org/10.1145/3379597.3387482 Google ScholarDigital Library
- V. Pontillo, F. Palomba, and F. Ferrucci. 2021. Toward Static Test Flakiness Prediction: A Feasibility Study. https://doi.org/10.6084/m9.figshare.14645895.v3 Google ScholarCross Ref
- V. Terragni, P. Salza, and F. Ferrucci. 2020. A container-based infrastructure for fuzzy-driven root causing of flaky tests. In ICSE 2020. 69–72.Google ScholarDigital Library
- S. Thorve, C. Sreshtha, and N. Meng. 2018. An empirical study of flaky tests in android apps. In ICSME 2018. 534–538. https://doi.org/10.1109/ICSME.2018.00062 Google ScholarCross Ref
- A. van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok. 2001. Refactoring test code. In XP 2001. 92–95.Google Scholar
- R. Verdecchia, E. Cruciani, B. Miranda, and A. Bertolino. 2021. Know You Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access, 9 (2021), 76119–76134. https://doi.org/10.1109/ACCESS.2021.3082424 Google ScholarCross Ref
- S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. Ernst, and D. Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA 2014. 385–396. https://doi.org/10.1145/2610384.2610404 Google ScholarDigital Library
Index Terms
- Toward static test flakiness prediction: a feasibility study
Recommendations
An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites
Fundamentals of Software EngineeringAbstractRandoop is a well-known tool that proposes a feedback-directed algorithm for automatic and random generation of unit tests for a given Java class. It automatically generates two test suites for the class under test: (1) an error-revealing test ...
Test flakiness’ causes, detection, impact and responses: A multivocal review
AbstractFlaky tests (tests with non-deterministic outcomes) pose a major challenge for software testing. They are known to cause significant issues, such as reducing the effectiveness and efficiency of testing and delaying software releases. In recent ...
Highlights- A detailed multivocal review of flaky tests in research and practice.
- Most studies covering test flakiness have focused more on Java.
- Flakiness due to test order dependency and concurrency are widely studied.
- Dynamic rerun-...
Static test flakiness prediction: How Far Can We Go?
AbstractTest flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the problem has been closely investigated by researchers and ...
Comments