research-article

Detecting false alarms from automatic static analysis tools: how far are we?

Authors:
Hong Jin Kang

Singapore Management University, Singapore, Singapore

Singapore Management University, Singapore, Singapore
View Profile

,
Khai Loong Aw

Singapore Management University, Singapore, Singapore

Singapore Management University, Singapore, Singapore
View Profile

,
David Lo

Singapore Management University, Singapore, Singapore

Singapore Management University, Singapore, Singapore
View Profile

ICSE '22: Proceedings of the 44th International Conference on Software EngineeringMay 2022Pages 698–709https://doi.org/10.1145/3510003.3510214

Published:05 July 2022Publication History

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 698–709

ABSTRACT

Automatic static analysis tools (ASATs), such as Findbugs, have a high false alarm rate. The large number of false alarms produced poses a barrier to adoption. Researchers have proposed the use of machine learning to prune false alarms and present only actionable warnings to developers. The state-of-the-art study has identified a set of "Golden Features" based on metrics computed over the characteristics and history of the file, code, and warning. Recent studies show that machine learning using these features is extremely effective and that they achieve almost perfect performance.

We perform a detailed analysis to better understand the strong performance of the "Golden Features". We found that several studies used an experimental procedure that results in data leakage and data duplication, which are subtle issues with significant implications. Firstly, the ground-truth labels have leaked into features that measure the proportion of actionable warnings in a given context. Secondly, many warnings in the testing dataset appear in the training dataset. Next, we demonstrate limitations in the warning oracle that determines the ground-truth labels, a heuristic comparing warnings in a given revision to a reference revision in the future. We show the choice of reference revision influences the warning distribution. Moreover, the heuristic produces labels that do not agree with human oracles. Hence, the strong performance of these techniques previously seen is overoptimistic of their true performance if adopted in practice. Our results convey several lessons and provide guidelines for evaluating false alarm detectors.

References

[n.d.]. Replication Package. https://github.com/soarsmu/SA_retrospective.Google Scholar
2021. Findbugs Filter file. http://findbugs.sourceforge.net/manual/filter.html.Google Scholar
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2019). 143--153. Google ScholarDigital Library
Nathaniel Ayewah and William Pugh. 2010. The Google Findbugs Fixit. In 19th International Symposium on Software Testing and Analysis (ISSTA 2010). 241--252. Google ScholarDigital Library
Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using Static Analysis to Find Bugs. IEEE Software 25, 5 (2008), 22--29. Google ScholarDigital Library
Vipin Balachandran. 2013. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 35th International Conference on Software Engineering (ICSE 2013). IEEE, 931--940. Google ScholarCross Ref
Moritz Beller, Radjino Bholanath, Shane McIntosh, and Andy Zaidman. 2016. Analyzing the State of Static Analysis: A Large-Scale Evaluation in Open Source Software. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016). IEEE Computer Society, 470--481. Google ScholarCross Ref
Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W O'Hearn. 2019. Scaling static analyses at Facebook. Commun. ACM 62, 8 (2019), 62--70. Google ScholarDigital Library
Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 49--60. Google ScholarDigital Library
David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment "Translation": Data, Metrics, Baselining & Evaluation. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020). IEEE, 746--757. Google ScholarDigital Library
Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018). IEEE, 317--328. Google ScholarDigital Library
Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In 11th Working Conference on mining software repositories (MSR 2014). 152--161. Google ScholarDigital Library
Sarah Heckman and Laurie Williams. 2008. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. In 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2008). 41--50. Google ScholarDigital Library
Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Identifying Actionable Static Analysis Alerts. In International Conference on Software Testing Verification and Validation (ICST 2009). IEEE, 161--170. Google ScholarDigital Library
Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology (IST) 53, 4 (2011), 363--387. Google ScholarDigital Library
Sarah Heckman and Laurie Williams. 2013. A comparative evaluation of static analysis actionable alert identification techniques. In 9th International Conference on Predictive Models in Software Engineering (PROMISE 2013). 1--10. Google ScholarDigital Library
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 763--773. Google ScholarDigital Library
David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN notices 39, 12 (2004), 92--106. Google ScholarDigital Library
Brittany Johnson, Yoonki Song, Emerson R. Murphy-Hill, and Robert W. Bowdidge. 2013. Why don't software developers use static analysis tools to find bugs?. In 35th International Conference on Software Engineering, (ICSE 2013). IEEE Computer Society, 672--681. Google ScholarCross Ref
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining Github. In 11th working conference on Mining Software Repositories (MSR 2014). 92--101. Google ScholarDigital Library
Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. 2019. Assessing the Generalizability of code2vec Token Embeddings. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019). IEEE, 1--12. Google ScholarDigital Library
Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in data mining: Formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 4 (2012), 1--21. Google ScholarDigital Library
Sunghun Kim and Michael D Ernst. 2007. Prioritizing warning categories by analyzing software history. In 4th International Workshop on Mining Software Repositories (MSR 07: ICSE Workshops 2007). IEEE, 27--27. Google ScholarDigital Library
Sunghun Kim, E James Whitehead, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering (TSE) 34, 2 (2008), 181--196. Google ScholarDigital Library
Ugur Koc, Shiyi Wei, Jeffrey S Foster, Marine Carpuat, and Adam A Porter. 2019. An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool. In 12th IEEE conference on software testing, validation and verification (ICST 2019). IEEE, 288--299. Google ScholarCross Ref
Pavneet Singh Kochhar, Yuan Tian, and David Lo. 2014. Potential biases in bug localization: Do they matter?. In Proceedings of the 29th ACM/IEEE international conference on Automated Software Engineering (ASE 2014). 803--814.Google ScholarDigital Library
Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. 2004. Correlation exploitation in error ranking. ACM SIGSOFT Software Engineering Notes 29, 6 (2004), 83--93. Google ScholarDigital Library
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174.Google Scholar
Guangtai Liang, Ling Wu, Qian Wu, Qianxiang Wang, Tao Xie, and Hong Mei. 2010. Automatic construction of an effective training set for prioritizing static analysis warnings. In IEEE/ACM international conference on Automated Software Engineering (ASE 2010). 93--102. Google ScholarDigital Library
Bo Lin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawendé F Bissyandé. 2021. Automated Comment Update: How Far are We?. In IEEE/ACM 29th International Conference on Program Comprehension (ICPC 2021). IEEE, 36--46. Google ScholarCross Ref
Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go?. In 40th International Conference on Software Engineering (ICSE 2018). 94--104. Google ScholarDigital Library
Kui Liu, Dongsun Kim, Tegawendé F Bissyandé, Shin Yoo, and Yves Le Traon. 2018. Mining Fix Patterns for Findbugs Violations. IEEE Transactions on Software Engineering (2018). Google ScholarDigital Library
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2019). IEEE, 1--12. Google ScholarCross Ref
Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. In 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). 373--384. Google ScholarDigital Library
Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. 2019. Are static analysis violations really fixed? a closer look at realistic usage of SonarQube. In IEEE/ACM 27th International Conference on Program Comprehension (ICPC 2019). IEEE, 209--219. Google ScholarDigital Library
Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2015). IEEE, 161--170. Google ScholarCross Ref
Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology (IST) 135 (2021), 106552. Google ScholarCross Ref
Foyzur Rahman, Daryl Posnett, and Premkumar Devanbu. 2012. Recalling the "imprecision" of cross-project defect prediction. In ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE 2012). 1--11. Google ScholarDigital Library
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In 22nd ACM SIGKDD international conference on Knowledge Discovery and Data mining. 1135--1144. Google ScholarDigital Library
Chanchal K Roy and James R Cordy. 2018. Benchmarks for software clone detection: A ten-year retrospective. In 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2018). IEEE, 26--37. Google ScholarCross Ref
Nick Rutar, Christian B Almazan, and Jeffrey S Foster. 2004. A Comparison of Bug Finding Tools for Java. In 15th International Symposium on Software Reliability Engineering (ISSRE 2004). IEEE, 245--256. Google ScholarDigital Library
Joseph Ruthruff, John Penix, J Morgenthaler, Sebastian Elbaum, and Gregg Rothermel. 2008. Predicting accurate and actionable static analysis warnings. In 30th ACM/IEEE International Conference on Software Engineering (ICSE 2008). IEEE, 341--350. Google ScholarDigital Library
Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at Google. Commun. ACM 61, 4 (2018), 58--66. Google ScholarDigital Library
Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma Soderberg, and Collin Winter. 2015. Tricorder: Building a Program Analysis Ecosystem. In IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE 2015), Vol. 1. IEEE, 598--608. Google ScholarCross Ref
Haihao Shen, Jianhong Fang, and Jianjun Zhao. 2011. Efindbugs: Effective Error Ranking for Findbugs. In 4th IEEE International Conference on Software Testing, Verification and Validation (ICST 2011). IEEE, 299--308. Google ScholarDigital Library
Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the NASA software defect datasets. IEEE Transactions on Software Engineering (TSE) 39, 9 (2013), 1208--1215.Google ScholarDigital Library
Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. 2003. Using Bench-marking to Advance Research: A Challenge to Software Engineering. In 25th International Conference on Software Engineering (ICSE 2003). IEEE, 74--83. Google ScholarCross Ref
Mohammad Tahaei, Kami Vaniea, Konstantin Beznosov, and Maria K Wolters. 2021. Security Notifications in Static Analysis Tools: Developers' Attitudes, Comprehension, and Ability to Act on Them. In 2021 CHI Conference on Human Factors in Computing Systems. 1--17. Google ScholarDigital Library
Ferdian Thung, David Lo, Lingxiao Jiang, Foyzur Rahman, Premkumar T Devanbu, et al. 2012. To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012). IEEE, 50--59. Google ScholarDigital Library
Kristín Fjóla Tómasdóttir, Mauricio Aniche, and Arie van Deursen. 2017. Why and how JavaScript developers use linters. In 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017). IEEE, 578--589. Google ScholarCross Ref
Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be careful of when: an empirical study on time-related misuse of issue tracking data. In 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). 307--318. Google ScholarDigital Library
Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering (EMSE) 25, 2 (2020), 1419--1457. Google ScholarCross Ref
Junjie Wang, Song Wang, and Qing Wang. 2018. Is there a "golden" feature set for static warning identification?: an experimental evaluation. In 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, (ESEM 2018). ACM, 17:1--17:10. Google ScholarDigital Library
Shaowei Wang and David Lo. 2014. Version history, similar report, and structure: Putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014). 53--63.Google ScholarDigital Library
Chadd C Williams and Jeffrey K Hollingsworth. 2005. Automatic Mining of Source Code Repositories to Improve Bug Finding Techniques. IEEE Transactions on Software Engineering (TSE) 31, 6 (2005), 466--480. Google ScholarDigital Library
Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to recognize actionable static code warnings (is intrinsically easy). Empirical Software Engineering (EMSE) 26, 3 (2021), 56. Google ScholarDigital Library
Xueqi Yang and Tim Menzies. 2021. Documenting evidence of a reproduction of 'is there a "golden" feature set for static warning identification?---an experimental evaluation'. In 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 1603--1603.Google ScholarDigital Library
Xueqi Yang, Zhe Yu, Junjie Wang, and Tim Menzies. 2021. Understanding static code warnings: An incremental AI approach. Expert Syst. Appl. 167 (2021), 114134. Google ScholarDigital Library
Ulas Yüksel and Hasan Sözer. 2013. Automated Classification of Static Code Analysis Alerts: A Case Study. In IEEE International Conference on Software Maintenance (ICSM 2013). IEEE, 532--535. Google ScholarDigital Library
Fiorella Zampetti, Simone Scalabrino, Rocco Oliveto, Gerardo Canfora, and Massimiliano Di Penta. 2017. How open source projects use static code analysis tools in continuous integration pipelines. In IEEE/ACM 14th International Conference on Mining Software Repositories (MSR 2017). IEEE, 334--344. Google ScholarDigital Library
Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2021). 427--438. Google ScholarDigital Library
Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment Analysis for Software Engineering: How Far Can Pre-trained Transformer Models Go?. In IEEE International Conference on Software Maintenance and Evolution (ICSME 2020). IEEE, 70--80. Google ScholarCross Ref
Qimu Zheng, Audris Mockus, and Minghui Zhou. 2015. A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). 637--648. Google ScholarDigital Library
Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International Conference on Software Engineering (ICSE 2012). IEEE, 14--24.Google ScholarCross Ref

Index Terms

Detecting false alarms from automatic static analysis tools: how far are we?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

Sound Non-Statistical Clustering of Static Analysis Alarms

We present a sound method for clustering alarms from static analyzers. Our method clusters alarms by discovering sound dependencies between them such that if the dominant alarms of a cluster turns out to be false, all the other alarms in the same cluster ...
Read More
Reducing False Alarms for Static Analysis via Weakest Precondition
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 01

Software security becomes more and more important. But bugs in programs are still inevitable. Compared to dynamic test, static analysis techniques is powerful to detect bugs before software release. However, static analyzers always suffer the problem of ...
Read More
Reducing False Alarms from an Industrial-Strength Static Analyzer by SVM
APSEC '14: Proceedings of the 2014 21st Asia-Pacific Software Engineering Conference - Volume 02

Static analysis tools are useful to find potential bugs and security vulnerabilities in a source code, however, false alarms from such tools lower their usability. In order to reduce various kinds of false alarms and enhance the performance of the tools,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
ISBN:9781450392211
DOI:10.1145/3510003
General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 July 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data duplication
data leakage
false alarms
static analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 447
  Total Downloads
- Downloads (Last 12 months)251
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting false alarms from automatic static analysis tools: how far are we?

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sound Non-Statistical Clustering of Static Analysis Alarms

Reducing False Alarms for Static Analysis via Weakest Precondition

Reducing False Alarms from an Industrial-Strength Static Analyzer by SVM