skip to main content
10.1145/3510003.3510214acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Detecting false alarms from automatic static analysis tools: how far are we?

Authors Info & Claims
Published:05 July 2022Publication History

ABSTRACT

Automatic static analysis tools (ASATs), such as Findbugs, have a high false alarm rate. The large number of false alarms produced poses a barrier to adoption. Researchers have proposed the use of machine learning to prune false alarms and present only actionable warnings to developers. The state-of-the-art study has identified a set of "Golden Features" based on metrics computed over the characteristics and history of the file, code, and warning. Recent studies show that machine learning using these features is extremely effective and that they achieve almost perfect performance.

We perform a detailed analysis to better understand the strong performance of the "Golden Features". We found that several studies used an experimental procedure that results in data leakage and data duplication, which are subtle issues with significant implications. Firstly, the ground-truth labels have leaked into features that measure the proportion of actionable warnings in a given context. Secondly, many warnings in the testing dataset appear in the training dataset. Next, we demonstrate limitations in the warning oracle that determines the ground-truth labels, a heuristic comparing warnings in a given revision to a reference revision in the future. We show the choice of reference revision influences the warning distribution. Moreover, the heuristic produces labels that do not agree with human oracles. Hence, the strong performance of these techniques previously seen is overoptimistic of their true performance if adopted in practice. Our results convey several lessons and provide guidelines for evaluating false alarm detectors.

References

  1. [n.d.]. Replication Package. https://github.com/soarsmu/SA_retrospective.Google ScholarGoogle Scholar
  2. 2021. Findbugs Filter file. http://findbugs.sourceforge.net/manual/filter.html.Google ScholarGoogle Scholar
  3. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2019). 143--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nathaniel Ayewah and William Pugh. 2010. The Google Findbugs Fixit. In 19th International Symposium on Software Testing and Analysis (ISSTA 2010). 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using Static Analysis to Find Bugs. IEEE Software 25, 5 (2008), 22--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Vipin Balachandran. 2013. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 35th International Conference on Software Engineering (ICSE 2013). IEEE, 931--940. Google ScholarGoogle ScholarCross RefCross Ref
  7. Moritz Beller, Radjino Bholanath, Shane McIntosh, and Andy Zaidman. 2016. Analyzing the State of Static Analysis: A Large-Scale Evaluation in Open Source Software. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016). IEEE Computer Society, 470--481. Google ScholarGoogle ScholarCross RefCross Ref
  8. Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W O'Hearn. 2019. Scaling static analyses at Facebook. Commun. ACM 62, 8 (2019), 62--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment "Translation": Data, Metrics, Baselining & Evaluation. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020). IEEE, 746--757. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018). IEEE, 317--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In 11th Working Conference on mining software repositories (MSR 2014). 152--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sarah Heckman and Laurie Williams. 2008. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. In 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2008). 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Identifying Actionable Static Analysis Alerts. In International Conference on Software Testing Verification and Validation (ICST 2009). IEEE, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology (IST) 53, 4 (2011), 363--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sarah Heckman and Laurie Williams. 2013. A comparative evaluation of static analysis actionable alert identification techniques. In 9th International Conference on Predictive Models in Software Engineering (PROMISE 2013). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 763--773. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN notices 39, 12 (2004), 92--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Brittany Johnson, Yoonki Song, Emerson R. Murphy-Hill, and Robert W. Bowdidge. 2013. Why don't software developers use static analysis tools to find bugs?. In 35th International Conference on Software Engineering, (ICSE 2013). IEEE Computer Society, 672--681. Google ScholarGoogle ScholarCross RefCross Ref
  20. Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining Github. In 11th working conference on Mining Software Repositories (MSR 2014). 92--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. 2019. Assessing the Generalizability of code2vec Token Embeddings. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in data mining: Formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 4 (2012), 1--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sunghun Kim and Michael D Ernst. 2007. Prioritizing warning categories by analyzing software history. In 4th International Workshop on Mining Software Repositories (MSR 07: ICSE Workshops 2007). IEEE, 27--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sunghun Kim, E James Whitehead, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering (TSE) 34, 2 (2008), 181--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ugur Koc, Shiyi Wei, Jeffrey S Foster, Marine Carpuat, and Adam A Porter. 2019. An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool. In 12th IEEE conference on software testing, validation and verification (ICST 2019). IEEE, 288--299. Google ScholarGoogle ScholarCross RefCross Ref
  26. Pavneet Singh Kochhar, Yuan Tian, and David Lo. 2014. Potential biases in bug localization: Do they matter?. In Proceedings of the 29th ACM/IEEE international conference on Automated Software Engineering (ASE 2014). 803--814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. 2004. Correlation exploitation in error ranking. ACM SIGSOFT Software Engineering Notes 29, 6 (2004), 83--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174.Google ScholarGoogle Scholar
  29. Guangtai Liang, Ling Wu, Qian Wu, Qianxiang Wang, Tao Xie, and Hong Mei. 2010. Automatic construction of an effective training set for prioritizing static analysis warnings. In IEEE/ACM international conference on Automated Software Engineering (ASE 2010). 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Bo Lin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawendé F Bissyandé. 2021. Automated Comment Update: How Far are We?. In IEEE/ACM 29th International Conference on Program Comprehension (ICPC 2021). IEEE, 36--46. Google ScholarGoogle ScholarCross RefCross Ref
  31. Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go?. In 40th International Conference on Software Engineering (ICSE 2018). 94--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kui Liu, Dongsun Kim, Tegawendé F Bissyandé, Shin Yoo, and Yves Le Traon. 2018. Mining Fix Patterns for Findbugs Violations. IEEE Transactions on Software Engineering (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2019). IEEE, 1--12. Google ScholarGoogle ScholarCross RefCross Ref
  34. Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. In 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). 373--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. 2019. Are static analysis violations really fixed? a closer look at realistic usage of SonarQube. In IEEE/ACM 27th International Conference on Program Comprehension (ICPC 2019). IEEE, 209--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2015). IEEE, 161--170. Google ScholarGoogle ScholarCross RefCross Ref
  37. Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology (IST) 135 (2021), 106552. Google ScholarGoogle ScholarCross RefCross Ref
  38. Foyzur Rahman, Daryl Posnett, and Premkumar Devanbu. 2012. Recalling the "imprecision" of cross-project defect prediction. In ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE 2012). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In 22nd ACM SIGKDD international conference on Knowledge Discovery and Data mining. 1135--1144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chanchal K Roy and James R Cordy. 2018. Benchmarks for software clone detection: A ten-year retrospective. In 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2018). IEEE, 26--37. Google ScholarGoogle ScholarCross RefCross Ref
  41. Nick Rutar, Christian B Almazan, and Jeffrey S Foster. 2004. A Comparison of Bug Finding Tools for Java. In 15th International Symposium on Software Reliability Engineering (ISSRE 2004). IEEE, 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Joseph Ruthruff, John Penix, J Morgenthaler, Sebastian Elbaum, and Gregg Rothermel. 2008. Predicting accurate and actionable static analysis warnings. In 30th ACM/IEEE International Conference on Software Engineering (ICSE 2008). IEEE, 341--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at Google. Commun. ACM 61, 4 (2018), 58--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma Soderberg, and Collin Winter. 2015. Tricorder: Building a Program Analysis Ecosystem. In IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE 2015), Vol. 1. IEEE, 598--608. Google ScholarGoogle ScholarCross RefCross Ref
  45. Haihao Shen, Jianhong Fang, and Jianjun Zhao. 2011. Efindbugs: Effective Error Ranking for Findbugs. In 4th IEEE International Conference on Software Testing, Verification and Validation (ICST 2011). IEEE, 299--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the NASA software defect datasets. IEEE Transactions on Software Engineering (TSE) 39, 9 (2013), 1208--1215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. 2003. Using Bench-marking to Advance Research: A Challenge to Software Engineering. In 25th International Conference on Software Engineering (ICSE 2003). IEEE, 74--83. Google ScholarGoogle ScholarCross RefCross Ref
  48. Mohammad Tahaei, Kami Vaniea, Konstantin Beznosov, and Maria K Wolters. 2021. Security Notifications in Static Analysis Tools: Developers' Attitudes, Comprehension, and Ability to Act on Them. In 2021 CHI Conference on Human Factors in Computing Systems. 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ferdian Thung, David Lo, Lingxiao Jiang, Foyzur Rahman, Premkumar T Devanbu, et al. 2012. To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012). IEEE, 50--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Kristín Fjóla Tómasdóttir, Mauricio Aniche, and Arie van Deursen. 2017. Why and how JavaScript developers use linters. In 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017). IEEE, 578--589. Google ScholarGoogle ScholarCross RefCross Ref
  51. Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be careful of when: an empirical study on time-related misuse of issue tracking data. In 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). 307--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering (EMSE) 25, 2 (2020), 1419--1457. Google ScholarGoogle ScholarCross RefCross Ref
  53. Junjie Wang, Song Wang, and Qing Wang. 2018. Is there a "golden" feature set for static warning identification?: an experimental evaluation. In 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, (ESEM 2018). ACM, 17:1--17:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Shaowei Wang and David Lo. 2014. Version history, similar report, and structure: Putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014). 53--63.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Chadd C Williams and Jeffrey K Hollingsworth. 2005. Automatic Mining of Source Code Repositories to Improve Bug Finding Techniques. IEEE Transactions on Software Engineering (TSE) 31, 6 (2005), 466--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to recognize actionable static code warnings (is intrinsically easy). Empirical Software Engineering (EMSE) 26, 3 (2021), 56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xueqi Yang and Tim Menzies. 2021. Documenting evidence of a reproduction of 'is there a "golden" feature set for static warning identification?---an experimental evaluation'. In 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 1603--1603.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xueqi Yang, Zhe Yu, Junjie Wang, and Tim Menzies. 2021. Understanding static code warnings: An incremental AI approach. Expert Syst. Appl. 167 (2021), 114134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ulas Yüksel and Hasan Sözer. 2013. Automated Classification of Static Code Analysis Alerts: A Case Study. In IEEE International Conference on Software Maintenance (ICSM 2013). IEEE, 532--535. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Fiorella Zampetti, Simone Scalabrino, Rocco Oliveto, Gerardo Canfora, and Massimiliano Di Penta. 2017. How open source projects use static code analysis tools in continuous integration pipelines. In IEEE/ACM 14th International Conference on Mining Software Repositories (MSR 2017). IEEE, 334--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2021). 427--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment Analysis for Software Engineering: How Far Can Pre-trained Transformer Models Go?. In IEEE International Conference on Software Maintenance and Evolution (ICSME 2020). IEEE, 70--80. Google ScholarGoogle ScholarCross RefCross Ref
  63. Qimu Zheng, Audris Mockus, and Minghui Zhou. 2015. A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). 637--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International Conference on Software Engineering (ICSE 2012). IEEE, 14--24.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Detecting false alarms from automatic static analysis tools: how far are we?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICSE '22: Proceedings of the 44th International Conference on Software Engineering
      May 2022
      2508 pages
      ISBN:9781450392211
      DOI:10.1145/3510003

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 July 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate276of1,856submissions,15%

      Upcoming Conference

      ICSE 2025

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader