ABSTRACT
Context: Defect prediction research is based on a small number of defect datasets and most are at class not method level. Consequently our knowledge of defects is limited. Identifying defect datasets for prediction is not easy and extracting quality data from identified datasets is even more difficult. Goal: Identify open source Java systems suitable for defect prediction and extract high quality fault data from these datasets. Method: We used the Boa to identify candidate open source systems. We reduce 50,000 potential candidates down to 23 suitable for defect prediction using a selection criteria based on the system's software repository and its defect tracking system. We use an enhanced SZZ algorithm to extract fault information and calculate metrics using JHawk. Result: We have produced 138 fault and metrics datasets for the 23 identified systems. We make these datasets (the ELFF datasets) and our data extraction tools freely available to future researchers. Conclusions: The data we provide enables future studies to proceed with minimal effort. Our datasets significantly increase the pool of systems currently being used in defect analysis studies.
- C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 121--130, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- C. Bird, A. Bachmann, F. Rahman, and A. Bernstein. Linkster: Enabling efficient manual inspection and annotation of mined data. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE '10, pages 369--370, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. Bowes, T. Hall, and D. Gray. Dconfusion: A technique to allow cross study performance evaluation of fault prediction studies. Automated Software Engineering, 21(2):287--13, 4 2014. Google ScholarDigital Library
- D. Cubranic and G. Murphy. Hipikat: recommending pertinent software development artifacts. In Software Engineering, 2003. Proceedings. 25th International Conference on, pages 408--418. Google ScholarDigital Library
- R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 35th International Conference on Software Engineering, pages 422--431, 2013. Google ScholarDigital Library
- M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Software Maintenance, 2003. Proceedings. International Conference on, pages 23--32. Google ScholarDigital Library
- M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proceedings of the International Conference on Software Maintenance, pages 23--, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the nasa metrics data program data sets for automated software defect prediction. In Evaluation Assessment in Software Engineering (EASE 2011), 15th Annual Conference on, pages 96--103.Google Scholar
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. Reflections on the nasa mdp data sets. Software, IET, 6(6):549--558, 2012.Google ScholarCross Ref
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6): 1276--1304, 2012. Google ScholarDigital Library
- IEEE. IEEE standard classification for software anomalies. IEEE Std 1044-2009 (Revision of IEEE Std 1044-1993), pages 1--23, 2010.Google Scholar
- M. Jureczko. Significance of different software metrics in defect prediction. Software Engineering: An International Journal, 1(1):86--95, 2011.Google Scholar
- Y. Kamei and E. Shihab. Defect prediction: Accomplishments and future challenges. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 5, pages 33--45. IEEE, 2016.Google ScholarCross Ref
- K. Kawata, S. Amasaki, and T. Yokogawa. Improving relevancy filter methods for cross-project defect prediction. In Applied Computing & Information Technology, pages 1--12. Springer, 2016. Google ScholarDigital Library
- S. Kim, T. Zimmermann, K. Pan, and E. J. J. Whitehead. Automatic identification of bug-introducing changes. In Proceedings of the 21st International Conference on Automated Software Engineering, pages 81--90, USA, 2006. Google ScholarDigital Library
- S. Kim, T. Zimmermann, E. Whitehead, and A. Zeller. Predicting faults from cached history. In Software Engineering, 2007. ICSE 2007. 29th International Conference on, pages 489--498, may 2007. Google ScholarDigital Library
- S. Kim, H. Zhang, R. Wu, and L. Gong. Dealing with noise in defect prediction. In Software Engineering (ICSE), 2011 33rd International Conference on, pages 481--490, may 2011. Google ScholarDigital Library
- A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. Nguyen. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 476--481, Nov 2015.Google ScholarDigital Library
- L. Layman, N. Nagappan, S. Guckenheimer, J. Beehler, and A. Begel. Mining software effort data: preliminary analysis of visual studio team system data. In Proceedings of the 2008 international working conference on Mining software repositories, pages 43--46. ACM, 2008. Google ScholarDigital Library
- T.-D. B. Le, M. Linares-Vásquez, D. Lo, and D. Poshyvanyk. Rclinker: Automated linking of issue reports and commits leveraging rich contextual information. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ICPC '15, pages 36--47, Piscataway, NJ, USA, 2015. IEEE Press. Google ScholarDigital Library
- W. Ma, L. Chen, Y. Yang, Y. Zhou, and B. Xu. Empirical analysis of network measures for effort-aware fault-proneness prediction. Information and Software Technology, 69:50--70, 2016. Google ScholarDigital Library
- T. Menzies and J. Di Stefano. How good is your blind spot sampling policy. In High Assurance Systems Engineering, 2004. Proceedings. Eighth IEEE International Symposium on, pages 129--138, March 2004. Google ScholarDigital Library
- T. Menzies, R. Krishna, and D. Pryor. The promise repository of empirical software engineering data, 2015. URL http://openscience.us/repo.Google Scholar
- N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In Software Engineering, 2005. Proceedings. 27th International Conference on, pages 284--292. IEEE, 2005. Google ScholarDigital Library
- N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy. Change bursts as defect predictors. In Software Reliability Engineering (ISSRE), 2010 IEEE 21st International Symposium on, pages 309--318. Google ScholarDigital Library
- A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Multi-layered approach for recovering links between bug reports and fixes. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE '12, pages 63:1--63:11, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- J. Petrić, D. Bowes, T. Hall, B. Christianson, and N. Baddoo. The jinx on the nasa software defect data sets. In The 20th International Conference on Evaluation and Assessment in Software Engineering (EASE'16), 2016. Google ScholarDigital Library
- M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. Software Engineering, IEEE Transactions on, 39(9):1208--1215, Sept 2013. Google ScholarDigital Library
- J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes? SIGSOFT Softw. Eng. Notes, 30 (4): 1--5, May 2005. Google ScholarDigital Library
- C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto. Automated parameter optimization of classification techniques for defect prediction models. In The International Conference on Software Engineering (ICSE), 2016. Google ScholarDigital Library
- C. C. Williams and J. Spacco. Szz revisited: verifying when changes induce fixes. In DEFECTS, pages 32--36, 2008. Google ScholarDigital Library
- R. Wu, H. Zhang, S. Kim, and S.-C. Cheung. Relink: recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 15--25. ACM, 2011. Google ScholarDigital Library
- F. Zhang. Towards Generalizing Defect Prediction Models. PhD thesis, Queen's University, 2016.Google Scholar
- T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, May 2007. Google ScholarDigital Library
Recommendations
Preliminary comparison of techniques for dealing with imbalance in software defect prediction
EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software EngineeringImbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the ...
Heterogeneous defect prediction
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software EngineeringSoftware defect prediction is one of the most active research areas in software engineering. We can build a prediction model with defect data collected from a software project and predict defects in the same project, i.e. within-project defect ...
Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction
Web Information Systems and ApplicationsAbstractDefect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward ...
Comments