ABSTRACT
Software defect prediction is one of the most active research areas in software engineering. We can build a prediction model with defect data collected from a software project and predict defects in the same project, i.e. within-project defect prediction (WPDP). Researchers also proposed cross-project defect prediction (CPDP) to predict defects for new projects lacking in defect data by using prediction models built by other projects. In recent studies, CPDP is proved to be feasible. However, CPDP requires projects that have the same metric set, meaning the metric sets should be identical between projects. As a result, current techniques for CPDP are difficult to apply across projects with heterogeneous metric sets. To address the limitation, we propose heterogeneous defect prediction (HDP) to predict defects across projects with heterogeneous metric sets. Our HDP approach conducts metric selection and metric matching to build a prediction model between projects with heterogeneous metric sets. Our empirical study on 28 subjects shows that about 68% of predictions using our approach outperform or are comparable to WPDP with statistical significance.
- A. Arcuri and L. Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering, pages 1–10, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- V. R. Basili, L. C. Briand, and W. L. Melo. A validation of object-oriented design metrics as quality indicators. IEEE Trans. Softw. Eng., 22:751–761, October 1996. Google ScholarDigital Library
- J. Bruin. newtest: command to compute new test, http://www.ats.ucla.edu/stat/stata/ado/analysis/, Feb. 2011.Google Scholar
- G. Canfora, A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, and S. Panichella. Multi-objective cross-project defect prediction. In Software Testing, Verification and Validation, 2013 IEEE Sixth International Conference on, March 2013. Google ScholarDigital Library
- C. Catal and B. Diri. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8):1040 – 1058, 2009. Google ScholarDigital Library
- S. R. Chidamber and C. F. Kemerer. A metrics suite for object oriented design. IEEE Trans. Softw. Eng., 20:476–493, June 1994. Google ScholarDigital Library
- G. W. Corder and D. I. Foreman. Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach. New Jersey: Wiley, 2009.Google ScholarCross Ref
- M. D’Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17(4-5):531–577, 2012. Google ScholarDigital Library
- T. Fukushima, Y. Kamei, S. McIntosh, K. Yamashita, and N. Ubayashi. An empirical study of just-in-time defect prediction using cross-project models. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 172–181, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- K. Gao, T. M. Khoshgoftaar, H. Wang, and N. Seliya. Choosing software metrics for defect prediction: An investigation on feature selection techniques. Softw. Pract. Exper., 41(5):579–606, Apr. 2011. Google ScholarDigital Library
- B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. of the 37th Int’l Conf. on Software Engineering (ICSE), ICSE ’15, pages 789–800, 2015.Google ScholarCross Ref
- E. Giger, M. D’Ambros, M. Pinzger, and H. C. Gall. Method-level bug prediction. In Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pages 171–180, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, Mar. 2003. Google ScholarDigital Library
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11:10–18, November 2009. Google ScholarDigital Library
- M. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. Knowledge and Data Engineering, IEEE Transactions on, 15(6):1437–1447, Nov 2003. Google ScholarDigital Library
- M. H. Halstead. Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., New York, NY, USA, 1977. Google ScholarDigital Library
- P. He, B. Li, X. Liu, J. Chen, and Y. Ma. An empirical study on software defect prediction with a simplified metric set. Information and Software Technology, 59(0):170 – 190, 2015. Google ScholarDigital Library
- P. He, B. Li, and Y. Ma. Towards cross-project defect prediction with imbalanced feature sets. CoRR, abs/1411.4228, 2014.Google Scholar
- Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 19(2):167–199, 2012. Google ScholarDigital Library
- Y. Kamei, E. Shihab, B. Adams, A. Hassan, A. Mockus, A. Sinha, and N. Ubayashi. A large-scale empirical study of just-in-time quality assurance. Software Engineering, IEEE Transactions on, 39(6):757–773, June 2013. Google ScholarDigital Library
- D. Kim, Y. Tao, S. Kim, and A. Zeller. Where should we fix this bug? a two-phase recommendation model. Software Engineering, IEEE Transactions on, 39(11):1597–1610, Nov 2013. Google ScholarDigital Library
- M. Kläs, F. Elberzhager, J. Münch, K. Hartjes, and O. von Graevemeyer. Transparent combination of expert and measurement data for defect prediction: an industrial case study. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, pages 119–128, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy. Active learning and effort estimation: Finding the essential content of software effort estimation data. Software Engineering, IEEE Transactions on, 39(8):1040–1053, 2013. Google ScholarDigital Library
- T. Lee, J. Nam, D. Han, S. Kim, and I. P. Hoh. Micro interaction metrics for defect prediction. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, 2011. Google ScholarDigital Library
- S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. Software Engineering, IEEE Transactions on, 34(4):485–496, 2008. Google ScholarDigital Library
- M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19(2):201–230, 2012. Google ScholarDigital Library
- H. W. Lilliefors. On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318):pp. 399–402, 1967.Google ScholarCross Ref
- H. Liu, J. Li, and L. Wong. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13:51–60, 2002.Google Scholar
- Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol., 54(3):248–256, Mar. 2012. Google ScholarDigital Library
- F. J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253):68–78, 1951.Google ScholarCross Ref
- J. Matouek and B. Gärtner. Understanding and Using Linear Programming (Universitext). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Google ScholarCross Ref
- T. McCabe. A complexity measure. Software Engineering, IEEE Transactions on, SE-2(4):308–320, Dec 1976. Google ScholarDigital Library
- T. Mende. Replication of defect prediction studies: Problems, pitfalls and recommendations. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pages 5:1–5:10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Meneely, L. Williams, W. Snipes, and J. Osborne. Predicting failures with developer networks and social network analysis. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 13–23, 2008. Google ScholarDigital Library
- T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.Google Scholar
- T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng., 33:2–13, January 2007. Google ScholarDigital Library
- J. Nam, S. J. Pan, and S. Kim. Transfer defect learning. In Proceedings of the 2013 International Conference on Software Engineering, pages 382–391, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- T. Ostrand, E. Weyuker, and R. Bell. Predicting the location and number of faults in large software systems. Software Engineering, IEEE Transactions on, 31(4):340–355, April 2005. Google ScholarDigital Library
- A. Panichella, R. Oliveto, and A. De Lucia. Cross-project defect prediction models: L’union fait la force. In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on, pages 164–173, Feb 2014.Google ScholarCross Ref
- F. Peters and T. Menzies. Privacy and utility for defect prediction: experiments with morph. In Proceedings of the 2012 International Conference on Software Engineering, pages 189–199, Piscataway, NJ, USA, 2012. IEEE Press. Google ScholarDigital Library
- M. Pinzger, N. Nagappan, and B. Murphy. Can developer-module networks predict failures? In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 2–12, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- F. Rahman and P. Devanbu. How, and why, process metrics are better. In Proceedings of the 2013 International Conference on Software Engineering, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- F. Rahman, D. Posnett, and P. Devanbu. Recalling the “imprecision” of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Ryu, O. Choi, and J. Baik. Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, pages 1–29, 2014.Google Scholar
- M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. Software Engineering, IEEE Transactions on, 39(9):1208–1215, Sept 2013. Google ScholarDigital Library
- E. Shihab, A. Mockus, Y. Kamei, B. Adams, and A. E. Hassan. High-impact defects: a study of breakage and surprise defects. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 300–310, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- S. Shivaji, E. J. Whitehead, R. Akella, and S. Kim. Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering, 39(4):552–569, 2013. Google ScholarDigital Library
- Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu. A general software defect-proneness prediction framework. Software Engineering, IEEE Transactions on, 37(3):356–370, 2011. Google ScholarDigital Library
- C. Spearman. The proof and measurement of association between two things. International Journal of Epidemiology, 39(5):1137–1150, 2010.Google ScholarCross Ref
- B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1-2):62–74, 2012. Google ScholarDigital Library
- B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Eng., 14:540–578, October 2009. Google ScholarDigital Library
- Understand 2.0. http://www.scitools.com/products/.Google Scholar
- G. Valentini and T. G. Dietterich. Low bias bagged support vector machines. In Proceedings of the Twentieth International Conference on Machine Learning, pages 752–759. AAAI Press, 2003.Google Scholar
- S. Watanabe, H. Kaiya, and K. Kaijiri. Adapting a fault prediction model to allow inter languagereuse. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pages 19–24, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- F. Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, Dec. 1945.Google ScholarCross Ref
- R. Wu, H. Zhang, S. Kim, and S. Cheung. Relink: Recovering links between bugs and changes. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, 2011. Google ScholarDigital Library
- F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou. Towards building a universal defect prediction model. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 182–191, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- T. Zimmermann and N. Nagappan. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th international conference on Software engineering, pages 531–540, 2008. Google ScholarDigital Library
- T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 91–100, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Index Terms
- Heterogeneous defect prediction
Recommendations
Cross-project defect prediction: a large scale experiment on data vs. domain vs. process
ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineeringPrediction of software defects works well within projects as long as there is a sufficient amount of data available to train any models. However, this is rarely the case for new software projects and for many companies. So far, only a few have studies ...
Moving from cross-project defect prediction to heterogeneous defect prediction: a partial replication study
CASCON '20: Proceedings of the 30th Annual International Conference on Computer Science and Software EngineeringSoftware defect prediction heavily relies on the metrics collected from software projects. Earlier studies often used machine learning techniques to build, validate, and improve bug prediction models using either a set of metrics collected within a ...
A Novel Feature Selection Approach based on Binary Particle Swarm Optimization and Ensemble Learning for Heterogeneous Defect Prediction
APIT '21: Proceedings of the 2021 3rd Asia Pacific Information Technology ConferenceSoftware defect prediction is an integral part of the software development process. Defect prediction helps focus on the grey areas beforehand, thus saving the considerable amount of money that is otherwise wasted in finding and fixing the faults once ...
Comments