research-article

Toward static test flakiness prediction: a feasibility study

Authors:
Valeria Pontillo

University of Salerno, Italy

University of Salerno, Italy

0000-0001-6012-9947
View Profile

,
Fabio Palomba

University of Salerno, Italy

University of Salerno, Italy

0000-0001-9337-5116
View Profile

,
Filomena Ferrucci

University of Salerno, Italy

University of Salerno, Italy

0000-0002-0975-8972
View Profile

MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality EvolutionAugust 2021Pages 19–24https://doi.org/10.1145/3472674.3473981

Published:23 August 2021Publication History

MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution

Pages 19–24

ABSTRACT

Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While the research community has attempted to define automated approaches for detecting and addressing test flakiness, most of them suffer from scalability issues and uncertainty as they require test cases to be run multiple times. This limitation has been recently targeted by means of machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify the extent to which the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness.

References

A. Alshammari, C. Morris, M. Hilton, and J. Bell. 2021. Flakeflagger: Predicting flakiness without rerunning tests. In ICSE 2021. 1572–1584. https://doi.org/10.1109/ICSE43902.2021.00140 Google ScholarDigital Library
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE 2018. 433–444. https://doi.org/10.1145/3180155.3180164 Google ScholarDigital Library
B. Camara, M. Silva, A. Endo, and S. Vergilio. 2021. What is the Vocabulary of Flaky Tests? An Extended Replication. arXiv preprint arXiv:2103.12670.Google Scholar
G. Catolino, F. Palomba, A. Zaidman, and F. Ferrucci. 2019. How the experience of development teams relates to assertion density of test classes. In ICSME 2019. 223–234. https://doi.org/10.1109/ICSME.2019.00034 Google ScholarCross Ref
S. Chidamber and C. Kemerer. 1994. A metrics suite for object oriented design. IEEE TSE, 20, 6 (1994), 476–493. https://doi.org/10.1109/32.295895 Google ScholarDigital Library
B. Daniel, V. Jagannath, D. Dig, and D. Marinov. 2009. ReAssert: Suggesting repairs for broken unit tests. In ASE 2009. 433–444. https://doi.org/10.1109/ASE.2009.17 Google ScholarDigital Library
M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. In ESEC/FSE 2019. 830–840. https://doi.org/10.1145/3338906.3338945 Google ScholarDigital Library
M. Fowler. 2011. Eradicating non-determinism in tests. Martin Fowler Personal Blog.Google Scholar
M. Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional. isbn:9788131734667Google ScholarDigital Library
G. Garson. 2012. Testing statistical assumptions. Asheboro, NC: Statistical Associates Publishing.Google Scholar
G. Grano, C. De Iaco, F. Palomba, and H. Gall. 2020. Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality. In ICSME 2020. 336–347. https://doi.org/10.1109/ICSME46990.2020.00040 Google ScholarCross Ref
G. Grano, F. Palomba, and H. Gall. 2019. Lightweight assessment of test-case effectiveness using source-code-quality indicators. IEEE TSE, https://doi.org/10.1109/TSE.2019.2903057 Google ScholarCross Ref
G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon. 2021. A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests. In MSR 2021. https://doi.org/10.1109/MSR52588.2021.00034 Google ScholarCross Ref
J. Han, M. Kamber, and J. Pei. 2011. Data mining concepts and techniques third edition. The Morgan Kaufmann Series in Data Management Systems, 5, 4 (2011), 83–124.Google Scholar
F. Lacoste. 2009. Killing the gatekeeper: Introducing a continuous integration system. In 2009 agile conference. 387–392. https://doi.org/10.1109/AGILE.2009.35 Google ScholarDigital Library
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST 2019. 312–322. https://doi.org/10.1109/ICST.2019.00038 Google ScholarCross Ref
W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In ISSRE 2020. 403–413. https://doi.org/10.1109/ISSRE5003.2020.00045 Google ScholarCross Ref
W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell. 2020. A large-scale longitudinal study of flaky tests. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), 1–29. https://doi.org/10.1145/3428270 Google ScholarDigital Library
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. 2014. An empirical analysis of flaky tests. In ESEC/FSE 2014. 643–653. https://doi.org/10.1145/2635868.2635920 Google ScholarDigital Library
T. McCabe. 1976. A Complexity Measure. IEEE TSE, SE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837 Google ScholarDigital Library
A. Memon and M. Cohen. 2013. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE 2013. 1479–1480. https://doi.org/10.1109/ICSE.2013.6606750 Google ScholarCross Ref
J. Micco. 2017. The state of continuous integration testing@ Google.Google Scholar
N. Moha, Y. Guéhéneuc, L. Duchien, and A. Le Meur. 2009. Decor: A method for the specification and detection of code and design smells. IEEE TSE, 36, 1 (2009), 20–36. https://doi.org/10.1109/TSE.2009.50 Google ScholarDigital Library
J. Murillo-Morera and M. Jenkins. 2015. A Software Defect-Proneness Prediction Framework: A new approach using genetic algorithms to generate learning schemes. In SEKE. 445–450. https://doi.org/10.18293/SEKE2015-099 Google ScholarCross Ref
J. Nelder and R. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 3 (1972), 370–384. https://doi.org/10.2307/2344614 Google ScholarCross Ref
R. O’brien. 2007. A caution regarding rules of thumb for variance inflation factors. Quality & quantity, 41, 5 (2007), 673–690. https://doi.org/10.1007/s11135-006-9018-6 Google ScholarCross Ref
F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering, 23, 3 (2018), 1188–1221. https://doi.org/10.1145/3180155.3182532 Google ScholarDigital Library
F. Pecorelli, G. Di Lillo, F. Palomba, and A. De Lucia. 2020. VITRuM: A Plug-In for the Visualization of Test-Related Metrics. In AVI 2020. 1–3. https://doi.org/10.1145/3399715.3399954 Google ScholarDigital Library
F. Pecorelli, F. Palomba, and A. De Lucia. 2021. The Relation of Test-Related Factors to Software Quality: A Case Study on Apache Systems. Empirical Software Engineering, 26, 2 (2021), https://doi.org/10.1007/s10664-020-09891-y Google ScholarCross Ref
A. Perez, R. Abreu, and A. van Deursen. 2017. A test-suite diagnosability metric for spectrum-based fault localization approaches. In ICSE 2017. 654–664. https://doi.org/10.1109/ICSE.2017.66 Google ScholarDigital Library
M. Pezze and M. Young. 2008. Software testing and analysis: process, principles, and techniques. John Wiley & Sons.Google Scholar
G. Pinto, B. Miranda, S. Dissanayake, M. D’Amorim, C. Treude, and A. Bertolino. 2020. What is the vocabulary of flaky tests? In MSR 2020. 492–502. https://doi.org/10.1145/3379597.3387482 Google ScholarDigital Library
V. Pontillo, F. Palomba, and F. Ferrucci. 2021. Toward Static Test Flakiness Prediction: A Feasibility Study. https://doi.org/10.6084/m9.figshare.14645895.v3 Google ScholarCross Ref
V. Terragni, P. Salza, and F. Ferrucci. 2020. A container-based infrastructure for fuzzy-driven root causing of flaky tests. In ICSE 2020. 69–72.Google ScholarDigital Library
S. Thorve, C. Sreshtha, and N. Meng. 2018. An empirical study of flaky tests in android apps. In ICSME 2018. 534–538. https://doi.org/10.1109/ICSME.2018.00062 Google ScholarCross Ref
A. van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok. 2001. Refactoring test code. In XP 2001. 92–95.Google Scholar
R. Verdecchia, E. Cruciani, B. Miranda, and A. Bertolino. 2021. Know You Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access, 9 (2021), 76119–76134. https://doi.org/10.1109/ACCESS.2021.3082424 Google ScholarCross Ref
S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. Ernst, and D. Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA 2014. 385–396. https://doi.org/10.1145/2610384.2610404 Google ScholarDigital Library

Index Terms

Toward static test flakiness prediction: a feasibility study
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Empirical software validation
      2. Software defect analysis
        Software testing and debugging

Recommendations

An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites
Fundamentals of Software Engineering
Abstract
Randoop is a well-known tool that proposes a feedback-directed algorithm for automatic and random generation of unit tests for a given Java class. It automatically generates two test suites for the class under test: (1) an error-revealing test ...
Read More
Test flakiness’ causes, detection, impact and responses: A multivocal review
Abstract
Flaky tests (tests with non-deterministic outcomes) pose a major challenge for software testing. They are known to cause significant issues, such as reducing the effectiveness and efficiency of testing and delaying software releases. In recent ...
Highlights
- A detailed multivocal review of flaky tests in research and practice.
- Most studies covering test flakiness have focused more on Java.
- Flakiness due to test order dependency and concurrency are widely studied.
- Dynamic rerun-...
Read More
Static test flakiness prediction: How Far Can We Go?
Abstract
Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the problem has been closely investigated by researchers and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution
August 2021
36 pages
ISBN:9781450386258
DOI:10.1145/3472674
General Chairs:
Apostolos Ampatzoglou
University of Macedonia, Greece
,
Daniel Feitosa
University of Groningen, Netherlands
,
Gemma Catolino
Tilburg University, Netherlands
,
Valentina Lenarduzzi
LUT University, Finland
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Empirical Studies
Flaky Tests
Software Quality Evaluation
Qualifiers
- research-article
Conference
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 186
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward static test flakiness prediction: a feasibility study

MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites

Test flakiness’ causes, detection, impact and responses: A multivocal review

Static test flakiness prediction: How Far Can We Go?