Skip to main content

2020 | OriginalPaper | Buchkapitel

Software Defect Prediction on Unlabelled Datasets: A Comparative Study

verfasst von : Elisabetta Ronchieri, Marco Canaparo, Mauro Belgiovine

Erschienen in: Computational Science and Its Applications – ICCSA 2020

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Background: Defect prediction on unlabelled datasets is a challenging and widespread problem in software engineering. Machine learning is of great value in this context because it provides techniques - called unsupervised - that are applicable to unlabelled datasets. Objective: This study aims at comparing various approaches employed over the years on unlabelled datasets to predict the defective modules, i.e. the ones which need more attention in the testing phase. Our comparison is based on the measurement of performance metrics and on the real defective information derived from software archives. Our work leverages a new dataset that has been obtained by extracting and preprocessing its metrics from a C++ software. Method: Our empirical study has taken advantage of CLAMI with its improvement CLAMI+ that we have applied on high energy physics software datasets. Furthermore, we have used clustering techniques such as the K-means algorithm to find potentially critical modules. Results: Our experimental analysis have been carried out on 1 open source project with 34 software releases. We have applied 17 ML techniques to the labelled datasets obtained by following the CLAMI and CLAMI+ approaches. The two approaches have been evaluated by using different performance metrics, our results show that CLAMI+ performs better than CLAMI. The predictive average accuracy metric is around 95% for 4 ML techniques (4 out of 17) that show a Kappa statistic greater than 0.80. We applied K-means on the same dataset and obtained 2 clusters labelled according to the output of CLAMI and CLAMI+. Conclusion: Based on the results of the different statistical tests, we conclude that no significant performance differences have been found in the selected classification techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Arar, O.F., Ayan, K.: Software defect prediction using cost-sensitive neural network. Appl. Softw. Comput. 33, 263–277 (2015)CrossRef Arar, O.F., Ayan, K.: Software defect prediction using cost-sensitive neural network. Appl. Softw. Comput. 33, 263–277 (2015)CrossRef
2.
Zurück zum Zitat Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 2–13 (2007)CrossRef Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 2–13 (2007)CrossRef
3.
Zurück zum Zitat Ronchieri, E., Canaparo, M.: Metrics for software reliability: a systematic mapping study. J. Integr. Des. Process Sci. 22, 5–25 (2018)CrossRef Ronchieri, E., Canaparo, M.: Metrics for software reliability: a systematic mapping study. J. Integr. Des. Process Sci. 22, 5–25 (2018)CrossRef
6.
Zurück zum Zitat Yang, J., Qian, H.: Defect prediction on unlabeled datasets by using unsupervised clustering. In: Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2016) Yang, J., Qian, H.: Defect prediction on unlabeled datasets by using unsupervised clustering. In: Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2016)
7.
Zurück zum Zitat Li, N., Shepperd, M.J., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020)CrossRef Li, N., Shepperd, M.J., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020)CrossRef
8.
Zurück zum Zitat Catal, C., Sevim, U., Diri, B.: Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: 2009 Sixth International Conference on Information Technology: New Generations, pp. 199–204 (2009) Catal, C., Sevim, U., Diri, B.: Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: 2009 Sixth International Conference on Information Technology: New Generations, pp. 199–204 (2009)
14.
Zurück zum Zitat Abaei, G., Selamat, A.: Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 179–193. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10389-1_13CrossRef Abaei, G., Selamat, A.: Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 179–193. Springer, Cham (2015). https://​doi.​org/​10.​1007/​978-3-319-10389-1_​13CrossRef
15.
Zurück zum Zitat Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839 Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 309–320 (2016). https://​doi.​org/​10.​1145/​2884781.​2884839
17.
Zurück zum Zitat Yan, M., Yang, M., Liu, C., Zhang, X.: Self-learning change-prone class prediction. In: The 28th International Conference on Software Engineering and Knowledge Engineering, SEKE 2016, Redwood City, San Francisco Bay, USA, 1–3 July 2016, pp. 134–140 (2016). https://doi.org/10.18293/SEKE2016-039 Yan, M., Yang, M., Liu, C., Zhang, X.: Self-learning change-prone class prediction. In: The 28th International Conference on Software Engineering and Knowledge Engineering, SEKE 2016, Redwood City, San Francisco Bay, USA, 1–3 July 2016, pp. 134–140 (2016). https://​doi.​org/​10.​18293/​SEKE2016-039
18.
Zurück zum Zitat Park, M., Hong, E.: Software fault prediction model using clustering algorithms determining the number of clusters automatically. Int. J. Softw. Eng. Appl. 8, 199–204 (2014) Park, M., Hong, E.: Software fault prediction model using clustering algorithms determining the number of clusters automatically. Int. J. Softw. Eng. Appl. 8, 199–204 (2014)
19.
Zurück zum Zitat Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2017)CrossRef Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2017)CrossRef
23.
Zurück zum Zitat Yan, M., Zhang, X., Liu, C., et al.: Automated change-prone class prediction on unlabeled dataset using unsupervised method. Inf. Softw. Technol. 92, 1–16 (2017)CrossRef Yan, M., Zhang, X., Liu, C., et al.: Automated change-prone class prediction on unlabeled dataset using unsupervised method. Inf. Softw. Technol. 92, 1–16 (2017)CrossRef
24.
25.
Zurück zum Zitat Agostinelli, S., Allison, J., Amako, K., et al.: GEANT4 - a simulation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A 506(3), 250–303 (2003) Agostinelli, S., Allison, J., Amako, K., et al.: GEANT4 - a simulation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A 506(3), 250–303 (2003)
26.
Zurück zum Zitat Ronchieri, E., Pia, M.G.: Assessing software quality in high energy and nuclear physics: the geant4 and root case studies and beyond. In: Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Sydney, Australia, Australia (2018) Ronchieri, E., Pia, M.G.: Assessing software quality in high energy and nuclear physics: the geant4 and root case studies and beyond. In: Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Sydney, Australia, Australia (2018)
29.
Zurück zum Zitat Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)CrossRef Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)CrossRef
30.
Zurück zum Zitat McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. SE 2(4), 308–320 (1976) McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. SE 2(4), 308–320 (1976)
31.
Zurück zum Zitat Halstead, M.H.: Elements of Software Science (1975) Halstead, M.H.: Elements of Software Science (1975)
32.
Zurück zum Zitat Chidamber, S.R., Kemerer, C.F.: Metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)CrossRef Chidamber, S.R., Kemerer, C.F.: Metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)CrossRef
33.
Zurück zum Zitat Zhang, H., Zhang, X.: Comments on data mining static code attributes to learn defect prediction. IEEE Trans. Softw. Eng. 33(9), 635–636 (2007)CrossRef Zhang, H., Zhang, X.: Comments on data mining static code attributes to learn defect prediction. IEEE Trans. Softw. Eng. 33(9), 635–636 (2007)CrossRef
34.
Zurück zum Zitat Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRef Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRef
37.
Zurück zum Zitat Garcìa, S., Fernandez, A., Luego, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2009)CrossRef Garcìa, S., Fernandez, A., Luego, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2009)CrossRef
39.
Zurück zum Zitat Iman, R.L., Davenport, J.M.: Approximations of the critical region of the friedman statistic. Commun. Stat. 9, 571–595 (1980)CrossRef Iman, R.L., Davenport, J.M.: Approximations of the critical region of the friedman statistic. Commun. Stat. 9, 571–595 (1980)CrossRef
40.
Zurück zum Zitat Calvo, B., Santafé, G.: scmamp: statistical comparison of multiple algorithms in multiple problems. R J. 8(1), 248–256 (2016)CrossRef Calvo, B., Santafé, G.: scmamp: statistical comparison of multiple algorithms in multiple problems. R J. 8(1), 248–256 (2016)CrossRef
41.
Zurück zum Zitat Bergmann, B., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. Mult. Hypotheses Test. 70, 100–115 (1988) Bergmann, B., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. Mult. Hypotheses Test. 70, 100–115 (1988)
42.
Zurück zum Zitat Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATH Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATH
44.
Zurück zum Zitat Azeem, N., Usmani, S.: Analysis of data mining based software defect prediction techniques. Glob. J. Comput. Sci. Technol. 11 (2011) Azeem, N., Usmani, S.: Analysis of data mining based software defect prediction techniques. Glob. J. Comput. Sci. Technol. 11 (2011)
45.
Zurück zum Zitat Wang, J., Ma, Y., Zhang, L., Gao, R., Wu, D.: Deep learning for smart manufacturing: methods and applications. J. Manuf. Syst. 48, 144–156 (2017)CrossRef Wang, J., Ma, Y., Zhang, L., Gao, R., Wu, D.: Deep learning for smart manufacturing: methods and applications. J. Manuf. Syst. 48, 144–156 (2017)CrossRef
50.
Zurück zum Zitat Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Cham (2002)MATH Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Cham (2002)MATH
53.
Zurück zum Zitat Srivastava, N., Krizhevsky, G.H.A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetMATH Srivastava, N., Krizhevsky, G.H.A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetMATH
54.
Zurück zum Zitat Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, New York, NY, USA, pp. 172–181. ACM (2014). https://doi.org/10.1145/2597073.2597075 Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, New York, NY, USA, pp. 172–181. ACM (2014). https://​doi.​org/​10.​1145/​2597073.​2597075
Metadaten
Titel
Software Defect Prediction on Unlabelled Datasets: A Comparative Study
verfasst von
Elisabetta Ronchieri
Marco Canaparo
Mauro Belgiovine
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-58802-1_25

Premium Partner