nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

Software Defect Prediction on Unlabelled Datasets: A Comparative Study

verfasst von : Elisabetta Ronchieri, Marco Canaparo, Mauro Belgiovine

Erschienen in: Computational Science and Its Applications – ICCSA 2020

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Background: Defect prediction on unlabelled datasets is a challenging and widespread problem in software engineering. Machine learning is of great value in this context because it provides techniques - called unsupervised - that are applicable to unlabelled datasets. Objective: This study aims at comparing various approaches employed over the years on unlabelled datasets to predict the defective modules, i.e. the ones which need more attention in the testing phase. Our comparison is based on the measurement of performance metrics and on the real defective information derived from software archives. Our work leverages a new dataset that has been obtained by extracting and preprocessing its metrics from a C++ software. Method: Our empirical study has taken advantage of CLAMI with its improvement CLAMI+ that we have applied on high energy physics software datasets. Furthermore, we have used clustering techniques such as the K-means algorithm to find potentially critical modules. Results: Our experimental analysis have been carried out on 1 open source project with 34 software releases. We have applied 17 ML techniques to the labelled datasets obtained by following the CLAMI and CLAMI+ approaches. The two approaches have been evaluated by using different performance metrics, our results show that CLAMI+ performs better than CLAMI. The predictive average accuracy metric is around 95% for 4 ML techniques (4 out of 17) that show a Kappa statistic greater than 0.80. We applied K-means on the same dataset and obtained 2 clusters labelled according to the output of CLAMI and CLAMI+. Conclusion: Based on the results of the different statistical tests, we conclude that no significant performance differences have been found in the selected classification techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel An Intelligent Cache Management for Data Analysis at CMS

Nächstes Kapitel Artificial Intelligence in Health Care: Predictive Analysis on Diabetes Using Machine Learning Algorithms

Arar, O.F., Ayan, K.: Software defect prediction using cost-sensitive neural network. Appl. Softw. Comput. 33, 263–277 (2015)CrossRef

Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 2–13 (2007)CrossRef

Ronchieri, E., Canaparo, M.: Metrics for software reliability: a systematic mapping study. J. Integr. Des. Process Sci. 22, 5–25 (2018)CrossRef

Malhotra, R., Bansal, A.J.: Cross project change prediction using open source projects. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE (2014). https://doi.org/10.1109/ICACCI.2014.6968347

Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Unsupervised learning for expert-based software quality estimation. In: Proceedings of the 8th IEEE International Symposium on High Assurance Systems Engineering. IEEE (2004). https://doi.org/10.1109/HASE.2004.1281739

Yang, J., Qian, H.: Defect prediction on unlabeled datasets by using unsupervised clustering. In: Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2016)

Li, N., Shepperd, M.J., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020)CrossRef

Catal, C., Sevim, U., Diri, B.: Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: 2009 Sixth International Conference on Information Technology: New Generations, pp. 199–204 (2009)

Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Analyzing software measurement data with clustering techniques. IEEE Intell. Syst. 19(2), 20–27 (2004). https://doi.org/10.1109/MIS.2004.1274907CrossRef

10.

Bishnu, P.S., Bhattacherjee, V.: Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans. Knowl. Data Eng. (2012). https://doi.org/10.1109/TKDE.2011.163CrossRef

11.

Aleem, S., Capretz, L.F., Ahmed, F.: Benchmarking machine learning techniques for software defect detection. Int. J. Softw. Eng. Appl. 6(3) (2015). https://doi.org/10.5121/ijsea.2015.6302

12.

Alsawalqah, H., Hijazi, N., Eshtay, M., et al.: Software defect prediction using heterogeneous ensemble classification based on segmented patterns. Appl. Sci. 10(1745) (2020). https://doi.org/10.3390/app10051745

13.

Yang, B., Zheng, X., Guo, P.: Software metrics data clustering for quality prediction. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS (LNAI), vol. 4114, pp. 959–964. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-37275-2_121CrossRef

14.

Abaei, G., Selamat, A.: Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 179–193. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10389-1_13CrossRef

15.

Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839

16.

Chang, R., Shen, X., Wang, B., Xu, Q.: A novel method for software defect prediction in the context of big data. In: 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), pp. 100–104 (2017). https://doi.org/10.1109/ICBDA.2017.8078785

17.

Yan, M., Yang, M., Liu, C., Zhang, X.: Self-learning change-prone class prediction. In: The 28th International Conference on Software Engineering and Knowledge Engineering, SEKE 2016, Redwood City, San Francisco Bay, USA, 1–3 July 2016, pp. 134–140 (2016). https://doi.org/10.18293/SEKE2016-039

18.

Park, M., Hong, E.: Software fault prediction model using clustering algorithms determining the number of clusters automatically. Int. J. Softw. Eng. Appl. 8, 199–204 (2014)

19.

Herbold, S., Trautsch, A., Grabowski, J.: A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 44(9), 811–833 (2017)CrossRef

20.

Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013). https://doi.org/10.1109/TSE.2013.11CrossRef

21.

Peters, F., Menzies, T., Gong, L., Zhang, H.: Balancing privacy and utility in cross-company defect prediction. IEEE Trans. Softw. Eng. 39(8), 1054–1068 (2013). https://doi.org/10.1109/TSE.2013.6CrossRef

22.

Nam, J., Kim, S.: CLAMI: defect prediction on unlabeled datasets (T). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE (2015). https://doi.org/10.1109/ASE.2015.56

23.

Yan, M., Zhang, X., Liu, C., et al.: Automated change-prone class prediction on unlabeled dataset using unsupervised method. Inf. Softw. Technol. 92, 1–16 (2017)CrossRef

24.

Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: IEEE/ACM 37th International Conference of Software Engineering (2015). https://doi.org/10.1109/ICSE.2015.91

25.

Agostinelli, S., Allison, J., Amako, K., et al.: GEANT4 - a simulation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A 506(3), 250–303 (2003)

26.

Ronchieri, E., Pia, M.G.: Assessing software quality in high energy and nuclear physics: the geant4 and root case studies and beyond. In: Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Sydney, Australia, Australia (2018)

27.

Imagix: Reverse Engineering Tools - C, C++, Java - Imagix. https://www.imagix.com/

28.

Preston-Werner, T.: Semantic Versioning 2.0.0 (2013). https://semver.org/spec/v2.0.0.html

29.

Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)CrossRef

30.

McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. SE 2(4), 308–320 (1976)

31.

Halstead, M.H.: Elements of Software Science (1975)

32.

Chidamber, S.R., Kemerer, C.F.: Metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)CrossRef

33.

Zhang, H., Zhang, X.: Comments on data mining static code attributes to learn defect prediction. IEEE Trans. Softw. Eng. 33(9), 635–636 (2007)CrossRef

34.

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRef

35.

Yucalar, F., Ozcift, A., Borandag, E., Kilinc, D.: Multiple-classifiers in software quality engineering: combining predictors to improve software fault prediction ability. Eng. Sci. Technol. Int. J. (2019). https://doi.org/10.1016/j.jestch.2019.10.005CrossRef

36.

Yan, M., Xia, X., Shihab, E., et al.: Automating change-level self-admitted technical debt determination. IEEE Trans. Softw. Eng. 45(12), 1211–1229 (2019). https://doi.org/10.1109/TSE.2018.2831232CrossRef

37.

Garcìa, S., Fernandez, A., Luego, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2009)CrossRef

38.

Friedman, M.: A comparison of alternative tests of significance for the problem of M rankings. Annal. Math. Stat. 11(1), 86–92 (1940). https://www.jstor.org/stable/2235971

39.

Iman, R.L., Davenport, J.M.: Approximations of the critical region of the friedman statistic. Commun. Stat. 9, 571–595 (1980)CrossRef

40.

Calvo, B., Santafé, G.: scmamp: statistical comparison of multiple algorithms in multiple problems. R J. 8(1), 248–256 (2016)CrossRef

41.

Bergmann, B., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. Mult. Hypotheses Test. 70, 100–115 (1988)

42.

Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATH

43.

Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 43(1), 1–18 (2017). https://doi.org/10.1109/TSE.2016.2584050CrossRef

44.

Azeem, N., Usmani, S.: Analysis of data mining based software defect prediction techniques. Glob. J. Comput. Sci. Technol. 11 (2011)

45.

Wang, J., Ma, Y., Zhang, L., Gao, R., Wu, D.: Deep learning for smart manufacturing: methods and applications. J. Manuf. Syst. 48, 144–156 (2017)CrossRef

46.

Sculley, D.: Web-scale K-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, New York, NY, USA, pp. 1177–1178. ACM (2010). https://doi.org/10.1145/1772690.1772862

47.

Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489MathSciNetCrossRefMATH

48.

Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 2003, pp. 147–153. AAAI Press (2003), http://dl.acm.org/citation.cfm?id=3041838.3041857

49.

Kaur, D., Kaur, A., Gulati, S., Aggarwal, M.: A clustering algorithm for software fault prediction. In: International Conference on Computer and Communication Technology (ICCCT) (2010). https://doi.org/10.1109/ICCCT.2010.5640474

50.

Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Cham (2002)MATH

51.

Usama, M., Qadir, J., Raza, A., et al.: Unsupervised machine learning for networking: techniques. applications and research challenges. IEEE Access 7, 65579–65615 (2019). https://doi.org/10.1109/ACCESS.2019.2916648

52.

Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012). https://doi.org/10.1145/2347736.2347755CrossRef

53.

Srivastava, N., Krizhevsky, G.H.A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetMATH

54.

Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, New York, NY, USA, pp. 172–181. ACM (2014). https://doi.org/10.1145/2597073.2597075

55.

Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414–423 (2014). https://doi.org/10.1145/2568225.2568320

Titel: Software Defect Prediction on Unlabelled Datasets: A Comparative Study
verfasst von: Elisabetta Ronchieri
Marco Canaparo
Mauro Belgiovine
Verlag: Springer International Publishing
Buch: Computational Science and Its Applications – ICCSA 2020
Print ISBN: 978-3-030-58801-4

Electronic ISBN: 978-3-030-58802-1

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-030-58802-1_25

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner