ABSTRACT
Automatic static analysis tools (ASATs), such as Findbugs, have a high false alarm rate. The large number of false alarms produced poses a barrier to adoption. Researchers have proposed the use of machine learning to prune false alarms and present only actionable warnings to developers. The state-of-the-art study has identified a set of "Golden Features" based on metrics computed over the characteristics and history of the file, code, and warning. Recent studies show that machine learning using these features is extremely effective and that they achieve almost perfect performance.
We perform a detailed analysis to better understand the strong performance of the "Golden Features". We found that several studies used an experimental procedure that results in data leakage and data duplication, which are subtle issues with significant implications. Firstly, the ground-truth labels have leaked into features that measure the proportion of actionable warnings in a given context. Secondly, many warnings in the testing dataset appear in the training dataset. Next, we demonstrate limitations in the warning oracle that determines the ground-truth labels, a heuristic comparing warnings in a given revision to a reference revision in the future. We show the choice of reference revision influences the warning distribution. Moreover, the heuristic produces labels that do not agree with human oracles. Hence, the strong performance of these techniques previously seen is overoptimistic of their true performance if adopted in practice. Our results convey several lessons and provide guidelines for evaluating false alarm detectors.
- [n.d.]. Replication Package. https://github.com/soarsmu/SA_retrospective.Google Scholar
- 2021. Findbugs Filter file. http://findbugs.sourceforge.net/manual/filter.html.Google Scholar
- Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2019). 143--153. Google ScholarDigital Library
- Nathaniel Ayewah and William Pugh. 2010. The Google Findbugs Fixit. In 19th International Symposium on Software Testing and Analysis (ISSTA 2010). 241--252. Google ScholarDigital Library
- Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using Static Analysis to Find Bugs. IEEE Software 25, 5 (2008), 22--29. Google ScholarDigital Library
- Vipin Balachandran. 2013. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 35th International Conference on Software Engineering (ICSE 2013). IEEE, 931--940. Google ScholarCross Ref
- Moritz Beller, Radjino Bholanath, Shane McIntosh, and Andy Zaidman. 2016. Analyzing the State of Static Analysis: A Large-Scale Evaluation in Open Source Software. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016). IEEE Computer Society, 470--481. Google ScholarCross Ref
- Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W O'Hearn. 2019. Scaling static analyses at Facebook. Commun. ACM 62, 8 (2019), 62--70. Google ScholarDigital Library
- Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 49--60. Google ScholarDigital Library
- David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment "Translation": Data, Metrics, Baselining & Evaluation. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020). IEEE, 746--757. Google ScholarDigital Library
- Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018). IEEE, 317--328. Google ScholarDigital Library
- Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In 11th Working Conference on mining software repositories (MSR 2014). 152--161. Google ScholarDigital Library
- Sarah Heckman and Laurie Williams. 2008. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. In 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2008). 41--50. Google ScholarDigital Library
- Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Identifying Actionable Static Analysis Alerts. In International Conference on Software Testing Verification and Validation (ICST 2009). IEEE, 161--170. Google ScholarDigital Library
- Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology (IST) 53, 4 (2011), 363--387. Google ScholarDigital Library
- Sarah Heckman and Laurie Williams. 2013. A comparative evaluation of static analysis actionable alert identification techniques. In 9th International Conference on Predictive Models in Software Engineering (PROMISE 2013). 1--10. Google ScholarDigital Library
- Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 763--773. Google ScholarDigital Library
- David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN notices 39, 12 (2004), 92--106. Google ScholarDigital Library
- Brittany Johnson, Yoonki Song, Emerson R. Murphy-Hill, and Robert W. Bowdidge. 2013. Why don't software developers use static analysis tools to find bugs?. In 35th International Conference on Software Engineering, (ICSE 2013). IEEE Computer Society, 672--681. Google ScholarCross Ref
- Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining Github. In 11th working conference on Mining Software Repositories (MSR 2014). 92--101. Google ScholarDigital Library
- Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. 2019. Assessing the Generalizability of code2vec Token Embeddings. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019). IEEE, 1--12. Google ScholarDigital Library
- Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in data mining: Formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 4 (2012), 1--21. Google ScholarDigital Library
- Sunghun Kim and Michael D Ernst. 2007. Prioritizing warning categories by analyzing software history. In 4th International Workshop on Mining Software Repositories (MSR 07: ICSE Workshops 2007). IEEE, 27--27. Google ScholarDigital Library
- Sunghun Kim, E James Whitehead, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering (TSE) 34, 2 (2008), 181--196. Google ScholarDigital Library
- Ugur Koc, Shiyi Wei, Jeffrey S Foster, Marine Carpuat, and Adam A Porter. 2019. An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool. In 12th IEEE conference on software testing, validation and verification (ICST 2019). IEEE, 288--299. Google ScholarCross Ref
- Pavneet Singh Kochhar, Yuan Tian, and David Lo. 2014. Potential biases in bug localization: Do they matter?. In Proceedings of the 29th ACM/IEEE international conference on Automated Software Engineering (ASE 2014). 803--814.Google ScholarDigital Library
- Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. 2004. Correlation exploitation in error ranking. ACM SIGSOFT Software Engineering Notes 29, 6 (2004), 83--93. Google ScholarDigital Library
- J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174.Google Scholar
- Guangtai Liang, Ling Wu, Qian Wu, Qianxiang Wang, Tao Xie, and Hong Mei. 2010. Automatic construction of an effective training set for prioritizing static analysis warnings. In IEEE/ACM international conference on Automated Software Engineering (ASE 2010). 93--102. Google ScholarDigital Library
- Bo Lin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawendé F Bissyandé. 2021. Automated Comment Update: How Far are We?. In IEEE/ACM 29th International Conference on Program Comprehension (ICPC 2021). IEEE, 36--46. Google ScholarCross Ref
- Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go?. In 40th International Conference on Software Engineering (ICSE 2018). 94--104. Google ScholarDigital Library
- Kui Liu, Dongsun Kim, Tegawendé F Bissyandé, Shin Yoo, and Yves Le Traon. 2018. Mining Fix Patterns for Findbugs Violations. IEEE Transactions on Software Engineering (2018). Google ScholarDigital Library
- Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2019). IEEE, 1--12. Google ScholarCross Ref
- Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. In 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). 373--384. Google ScholarDigital Library
- Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. 2019. Are static analysis violations really fixed? a closer look at realistic usage of SonarQube. In IEEE/ACM 27th International Conference on Program Comprehension (ICPC 2019). IEEE, 209--219. Google ScholarDigital Library
- Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2015). IEEE, 161--170. Google ScholarCross Ref
- Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology (IST) 135 (2021), 106552. Google ScholarCross Ref
- Foyzur Rahman, Daryl Posnett, and Premkumar Devanbu. 2012. Recalling the "imprecision" of cross-project defect prediction. In ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE 2012). 1--11. Google ScholarDigital Library
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In 22nd ACM SIGKDD international conference on Knowledge Discovery and Data mining. 1135--1144. Google ScholarDigital Library
- Chanchal K Roy and James R Cordy. 2018. Benchmarks for software clone detection: A ten-year retrospective. In 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2018). IEEE, 26--37. Google ScholarCross Ref
- Nick Rutar, Christian B Almazan, and Jeffrey S Foster. 2004. A Comparison of Bug Finding Tools for Java. In 15th International Symposium on Software Reliability Engineering (ISSRE 2004). IEEE, 245--256. Google ScholarDigital Library
- Joseph Ruthruff, John Penix, J Morgenthaler, Sebastian Elbaum, and Gregg Rothermel. 2008. Predicting accurate and actionable static analysis warnings. In 30th ACM/IEEE International Conference on Software Engineering (ICSE 2008). IEEE, 341--350. Google ScholarDigital Library
- Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at Google. Commun. ACM 61, 4 (2018), 58--66. Google ScholarDigital Library
- Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma Soderberg, and Collin Winter. 2015. Tricorder: Building a Program Analysis Ecosystem. In IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE 2015), Vol. 1. IEEE, 598--608. Google ScholarCross Ref
- Haihao Shen, Jianhong Fang, and Jianjun Zhao. 2011. Efindbugs: Effective Error Ranking for Findbugs. In 4th IEEE International Conference on Software Testing, Verification and Validation (ICST 2011). IEEE, 299--308. Google ScholarDigital Library
- Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the NASA software defect datasets. IEEE Transactions on Software Engineering (TSE) 39, 9 (2013), 1208--1215.Google ScholarDigital Library
- Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. 2003. Using Bench-marking to Advance Research: A Challenge to Software Engineering. In 25th International Conference on Software Engineering (ICSE 2003). IEEE, 74--83. Google ScholarCross Ref
- Mohammad Tahaei, Kami Vaniea, Konstantin Beznosov, and Maria K Wolters. 2021. Security Notifications in Static Analysis Tools: Developers' Attitudes, Comprehension, and Ability to Act on Them. In 2021 CHI Conference on Human Factors in Computing Systems. 1--17. Google ScholarDigital Library
- Ferdian Thung, David Lo, Lingxiao Jiang, Foyzur Rahman, Premkumar T Devanbu, et al. 2012. To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012). IEEE, 50--59. Google ScholarDigital Library
- Kristín Fjóla Tómasdóttir, Mauricio Aniche, and Arie van Deursen. 2017. Why and how JavaScript developers use linters. In 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017). IEEE, 578--589. Google ScholarCross Ref
- Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be careful of when: an empirical study on time-related misuse of issue tracking data. In 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). 307--318. Google ScholarDigital Library
- Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering (EMSE) 25, 2 (2020), 1419--1457. Google ScholarCross Ref
- Junjie Wang, Song Wang, and Qing Wang. 2018. Is there a "golden" feature set for static warning identification?: an experimental evaluation. In 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, (ESEM 2018). ACM, 17:1--17:10. Google ScholarDigital Library
- Shaowei Wang and David Lo. 2014. Version history, similar report, and structure: Putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014). 53--63.Google ScholarDigital Library
- Chadd C Williams and Jeffrey K Hollingsworth. 2005. Automatic Mining of Source Code Repositories to Improve Bug Finding Techniques. IEEE Transactions on Software Engineering (TSE) 31, 6 (2005), 466--480. Google ScholarDigital Library
- Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to recognize actionable static code warnings (is intrinsically easy). Empirical Software Engineering (EMSE) 26, 3 (2021), 56. Google ScholarDigital Library
- Xueqi Yang and Tim Menzies. 2021. Documenting evidence of a reproduction of 'is there a "golden" feature set for static warning identification?---an experimental evaluation'. In 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 1603--1603.Google ScholarDigital Library
- Xueqi Yang, Zhe Yu, Junjie Wang, and Tim Menzies. 2021. Understanding static code warnings: An incremental AI approach. Expert Syst. Appl. 167 (2021), 114134. Google ScholarDigital Library
- Ulas Yüksel and Hasan Sözer. 2013. Automated Classification of Static Code Analysis Alerts: A Case Study. In IEEE International Conference on Software Maintenance (ICSM 2013). IEEE, 532--535. Google ScholarDigital Library
- Fiorella Zampetti, Simone Scalabrino, Rocco Oliveto, Gerardo Canfora, and Massimiliano Di Penta. 2017. How open source projects use static code analysis tools in continuous integration pipelines. In IEEE/ACM 14th International Conference on Mining Software Repositories (MSR 2017). IEEE, 334--344. Google ScholarDigital Library
- Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2021). 427--438. Google ScholarDigital Library
- Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment Analysis for Software Engineering: How Far Can Pre-trained Transformer Models Go?. In IEEE International Conference on Software Maintenance and Evolution (ICSME 2020). IEEE, 70--80. Google ScholarCross Ref
- Qimu Zheng, Audris Mockus, and Minghui Zhou. 2015. A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). 637--648. Google ScholarDigital Library
- Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International Conference on Software Engineering (ICSE 2012). IEEE, 14--24.Google ScholarCross Ref
Index Terms
- Detecting false alarms from automatic static analysis tools: how far are we?
Recommendations
Sound Non-Statistical Clustering of Static Analysis Alarms
We present a sound method for clustering alarms from static analyzers. Our method clusters alarms by discovering sound dependencies between them such that if the dominant alarms of a cluster turns out to be false, all the other alarms in the same cluster ...
Reducing False Alarms for Static Analysis via Weakest Precondition
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 01Software security becomes more and more important. But bugs in programs are still inevitable. Compared to dynamic test, static analysis techniques is powerful to detect bugs before software release. However, static analyzers always suffer the problem of ...
Reducing False Alarms from an Industrial-Strength Static Analyzer by SVM
APSEC '14: Proceedings of the 2014 21st Asia-Pacific Software Engineering Conference - Volume 02Static analysis tools are useful to find potential bugs and security vulnerabilities in a source code, however, false alarms from such tools lower their usability. In order to reduce various kinds of false alarms and enhance the performance of the tools,...
Comments