Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 2/2022

04-09-2021 | Research Article-Computer Engineering and Computer Science

An Improved Method for Training Data Selection for Cross-Project Defect Prediction

Authors: Nayeem Ahmad Bhat, Sheikh Umar Farooq

Published in: Arabian Journal for Science and Engineering | Issue 2/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The selection of relevant training data significantly improves the quality of cross-project defect prediction (CPDP) process. We propose a training data selection approach and compare its performance against the Burak filter and the Peter filter over Bug Prediction Dataset. In our approach (BurakMHD), firstly a data transformation is applied to the datasets. Then, individual instances of the target project adds k-instances at a minimum Hamming distance each from the transformed multi-source defective and non-defective data instances to the filtered training dataset (filtered TDS). Compared to using all the cross-project data, the false positive rate decreases by 10.6% associated with a 2.6% decrease in defect detection rate. The overall performance nMCC, Balance, G-measure increase by 2.9%, 5.7%, 6.6%, respectively. Compared to Burak filter and Peter filter, defect detection rate increases by 1.5% and 1.8%, respectively, and the false positive rate decreases by 6.4%. The overall performance nMCC, Balance, G-measure increase by 3%, 5.3%, 6.8% and by 3.2%, 5.5%, 7.1% compared to Burak and Peter filter, respectively. Compared to within-project predictions, the overall performance nMCC, Balance, G-measure increase by 1.1%, 3.4%, 4%, respectively, and the defect detection rate and false positive rate decrease by 9.2% and 13.1%, respectively. In general, our approach improved the performance significantly, compared to the Burak filter, Peter filter, cross-project prediction, and within-project prediction. Therefore, we conclude, applying data transformation and filtering training data separately from the defective and non-defective instances of cross-project data is helpful to select the relevant data for CPDP.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Suhag, V.; Garg, A.; Dubey, S.K.; Sharma, B.K.: Analytical approach to cross project defect prediction. In: Pant, M., Sharma, T.K., Verma, O.P., Singla, R., Sikander, A. (eds.) Soft Computing: Theories and Applications, pp. 713–736. Springer, Singapore (2020)CrossRef Suhag, V.; Garg, A.; Dubey, S.K.; Sharma, B.K.: Analytical approach to cross project defect prediction. In: Pant, M., Sharma, T.K., Verma, O.P., Singla, R., Sikander, A. (eds.) Soft Computing: Theories and Applications, pp. 713–736. Springer, Singapore (2020)CrossRef
6.
go back to reference Hall, T.; Beecham, S.; Bowes, D.; Gray, D.; Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Software Eng. 38(6), 1276–1304 (2012)CrossRef Hall, T.; Beecham, S.; Bowes, D.; Gray, D.; Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Software Eng. 38(6), 1276–1304 (2012)CrossRef
7.
go back to reference Çatal, Ç.: The use of cross-company fault data for the software fault prediction problem. Turk. J. Electr. Eng. Comput. Sci. 24(5), 3714–3723 (2016)CrossRef Çatal, Ç.: The use of cross-company fault data for the software fault prediction problem. Turk. J. Electr. Eng. Comput. Sci. 24(5), 3714–3723 (2016)CrossRef
10.
go back to reference He, Z.; Shu, F.; Yang, Y.; Li, M.; Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)CrossRef He, Z.; Shu, F.; Yang, Y.; Li, M.; Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)CrossRef
11.
go back to reference Turhan, B.; Menzies, T.; Bener, A.B.; Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)CrossRef Turhan, B.; Menzies, T.; Bener, A.B.; Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14(5), 540–578 (2009)CrossRef
12.
go back to reference Peters, F.; Menzies, T.; Marcus, A.: Better cross company defect prediction. In: Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 409–418. IEEE Press (2013) Peters, F.; Menzies, T.; Marcus, A.: Better cross company defect prediction. In: Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 409–418. IEEE Press (2013)
13.
14.
go back to reference Li, Y.; Huang, Z.; Wang, Y.; Fang, B.: Evaluating data filter on cross-project defect prediction: comparison and improvements. IEEE Access 5, 25646–25656 (2017)CrossRef Li, Y.; Huang, Z.; Wang, Y.; Fang, B.: Evaluating data filter on cross-project defect prediction: comparison and improvements. IEEE Access 5, 25646–25656 (2017)CrossRef
15.
go back to reference He, P.; He, Y.; Yu, L.; Li, B.: An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data. Math. Probl. Eng. (2018) He, P.; He, Y.; Yu, L.; Li, B.: An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data. Math. Probl. Eng. (2018)
16.
go back to reference Rodriguez, D.; Herraiz, I.; Harrison, R.: On software engineering repositories and their open problems. In: 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE), pp. 52–56. IEEE (2012) Rodriguez, D.; Herraiz, I.; Harrison, R.: On software engineering repositories and their open problems. In: 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE), pp. 52–56. IEEE (2012)
17.
go back to reference Herbold, S.: Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, p. 6. ACM (2013) Herbold, S.: Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, p. 6. ACM (2013)
18.
go back to reference Riquelme, J.; Ruiz, R.; Rodríguez, D.; Moreno, J.: Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1), 67–74 (2008) Riquelme, J.; Ruiz, R.; Rodríguez, D.; Moreno, J.: Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1), 67–74 (2008)
19.
go back to reference Zumel, N.; Mount, J.: Practical Data Science with R. Manning Publications Co. (2014) Zumel, N.; Mount, J.: Practical Data Science with R. Manning Publications Co. (2014)
20.
go back to reference Chicco, D.; Jurman, G.: The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1), 6 (2020)CrossRef Chicco, D.; Jurman, G.: The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1), 6 (2020)CrossRef
22.
go back to reference Ma, Y.; Luo, G.; Zeng, X.; Chen, A.: Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 54(3), 248–256 (2012)CrossRef Ma, Y.; Luo, G.; Zeng, X.; Chen, A.: Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 54(3), 248–256 (2012)CrossRef
26.
27.
go back to reference Sun, Z.; Li, J.; Sun, H.; He, L.: Cfps: Collaborative filtering based source projects selection for cross-project defect prediction. Appl. Soft Comput. 99, 106940 (2021) Sun, Z.; Li, J.; Sun, H.; He, L.: Cfps: Collaborative filtering based source projects selection for cross-project defect prediction. Appl. Soft Comput. 99, 106940 (2021)
28.
go back to reference Nam, J.; Pan, S.J.; Kim, S.: Transfer defect learning. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 382–391. IEEE (2013) Nam, J.; Pan, S.J.; Kim, S.: Transfer defect learning. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 382–391. IEEE (2013)
29.
go back to reference Hosseini, S.; Turhan, B.; Mäntylä, M.: A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf. Softw. Technol. 95, 296–312 (2018)CrossRef Hosseini, S.; Turhan, B.; Mäntylä, M.: A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf. Softw. Technol. 95, 296–312 (2018)CrossRef
30.
go back to reference Menzies, T.; Butcher, A.; Marcus, A.; Zimmermann, T.; Cok, D.: Local vs. global models for effort estimation and defect prediction. In: 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), pp. 343–351. IEEE (2011) Menzies, T.; Butcher, A.; Marcus, A.; Zimmermann, T.; Cok, D.: Local vs. global models for effort estimation and defect prediction. In: 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), pp. 343–351. IEEE (2011)
31.
go back to reference Bettenburg, N.; Nagappan, M.; Hassan, A.E.: Think locally, act globally: Improving defect and effort prediction models. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp. 60–69. IEEE (2012) Bettenburg, N.; Nagappan, M.; Hassan, A.E.: Think locally, act globally: Improving defect and effort prediction models. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp. 60–69. IEEE (2012)
32.
go back to reference Menzies, T.; Butcher, A.; Cok, D.; Marcus, A.; Layman, L.; Shull, F.; Turhan, B.; Zimmermann, T.: Local vs. Global Lessons for Defect Prediction and Effort Estimation, IEEE Trans. Software Eng., preprint, published online Dec. 2012; http://goo. gl/k6qno. TIM MENZIES is a full professor in computer science at the Lane. In: Department of Computer Science and Electrical Engineering, West Virginia University. Contact. Citeseer (2012) Menzies, T.; Butcher, A.; Cok, D.; Marcus, A.; Layman, L.; Shull, F.; Turhan, B.; Zimmermann, T.: Local vs. Global Lessons for Defect Prediction and Effort Estimation, IEEE Trans. Software Eng., preprint, published online Dec. 2012; http://​goo.​ gl/k6qno. TIM MENZIES is a full professor in computer science at the Lane. In: Department of Computer Science and Electrical Engineering, West Virginia University. Contact. Citeseer (2012)
33.
go back to reference Wang, S.; Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)CrossRef Wang, S.; Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)CrossRef
34.
go back to reference Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S.: Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550 (2017)CrossRef Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S.: Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550 (2017)CrossRef
35.
go back to reference Bennin, K.E.; Keung, J.; Monden, A.; Phannachitta, P.; Mensah, S.: The significant effects of data sampling approaches on software defect prioritization and classification. In: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 364–373. IEEE Press (2017) Bennin, K.E.; Keung, J.; Monden, A.; Phannachitta, P.; Mensah, S.: The significant effects of data sampling approaches on software defect prioritization and classification. In: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 364–373. IEEE Press (2017)
36.
go back to reference Nguyen, G.H.; Bouzerdoum, A.; Phung, S.L.: Learning pattern classification tasks with imbalanced data sets. In: Pattern Recognition, IntechOpen (2009) Nguyen, G.H.; Bouzerdoum, A.; Phung, S.L.: Learning pattern classification tasks with imbalanced data sets. In: Pattern Recognition, IntechOpen (2009)
39.
go back to reference Limsettho, N.; Bennin, K.E.; Keung, J.W.; Hata, H.; Matsumoto, K.: Cross project defect prediction using class distribution estimation and oversampling. Inf. Softw. Technol. 100, 87–102 (2018)CrossRef Limsettho, N.; Bennin, K.E.; Keung, J.W.; Hata, H.; Matsumoto, K.: Cross project defect prediction using class distribution estimation and oversampling. Inf. Softw. Technol. 100, 87–102 (2018)CrossRef
40.
go back to reference He, Z., Peters, F., Menzies, T., Yang, Y.: Learning from open-source projects: an empirical study on defect prediction. In: 2013 ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement, pp. 45–54. IEEE (2013) He, Z., Peters, F., Menzies, T., Yang, Y.: Learning from open-source projects: an empirical study on defect prediction. In: 2013 ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement, pp. 45–54. IEEE (2013)
43.
go back to reference Menzies, T.; Dekhtyar, A.; Distefano, J.; Greenwald, J.: Problems with precision: a response to “comments on’ data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007) Menzies, T.; Dekhtyar, A.; Distefano, J.; Greenwald, J.: Problems with precision: a response to “comments on’ data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007)
44.
go back to reference Menzies, T.; Greenwald, J.; Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)CrossRef Menzies, T.; Greenwald, J.; Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)CrossRef
45.
go back to reference Chicco, D.: Ten quick tips for machine learning in computational biology. BioData Mining 10(1), 35 (2017)CrossRef Chicco, D.: Ten quick tips for machine learning in computational biology. BioData Mining 10(1), 35 (2017)CrossRef
46.
go back to reference Mann, H.B.; Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 50–60 (1947) Mann, H.B.; Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 50–60 (1947)
49.
go back to reference Capretz, L.F.; Xu, J.: An empirical validation of object-oriented design metrics for fault prediction. J. Comput. Sci. 4(7), 571 (2008)CrossRef Capretz, L.F.; Xu, J.: An empirical validation of object-oriented design metrics for fault prediction. J. Comput. Sci. 4(7), 571 (2008)CrossRef
50.
go back to reference Moser, R.; Pedrycz, W.; Succi, G.: A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th International Conference on Software Engineering, pp. 181–190. ACM (2008) Moser, R.; Pedrycz, W.; Succi, G.: A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th International Conference on Software Engineering, pp. 181–190. ACM (2008)
52.
go back to reference Changyong, F.; Hongyue, W.; Naiji, L.U.; Tian, C.; Hua, H.E.; Ying, L.U.: Others: Log-transformation and its implications for data analysis. Shanghai Arch. Psychiatry 26(2), 105 (2014) Changyong, F.; Hongyue, W.; Naiji, L.U.; Tian, C.; Hua, H.E.; Ying, L.U.: Others: Log-transformation and its implications for data analysis. Shanghai Arch. Psychiatry 26(2), 105 (2014)
53.
go back to reference Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008) Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
55.
go back to reference Bowes, D.; Hall, T.; Petrić, J.: Software defect prediction: do different classifiers find the same defects? Softw. Qual. J. 26(2), 525–552 (2018)CrossRef Bowes, D.; Hall, T.; Petrić, J.: Software defect prediction: do different classifiers find the same defects? Softw. Qual. J. 26(2), 525–552 (2018)CrossRef
Metadata
Title
An Improved Method for Training Data Selection for Cross-Project Defect Prediction
Authors
Nayeem Ahmad Bhat
Sheikh Umar Farooq
Publication date
04-09-2021
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 2/2022
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-021-06088-3

Other articles of this Issue 2/2022

Arabian Journal for Science and Engineering 2/2022 Go to the issue

Research Article-Computer Engineering and Computer Science

A Two-stage Method of Synchronization Prediction Framework in TDD

Premium Partners