Skip to main content
Erschienen in: Automated Software Engineering 1/2017

22.03.2016

Label propagation based semi-supervised learning for software defect prediction

verfasst von: Zhi-Wu Zhang, Xiao-Yuan Jing, Tie-Jian Wang

Erschienen in: Automated Software Engineering | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Software defect prediction can automatically predict defect-prone software modules for efficient software test in software engineering. When the previous defect labels of modules are limited, predicting the defect-prone modules becomes a challenging problem. In static software defect prediction, there exist the similarity among software modules, a software module can be approximated by a sparse representation of the other part of the software modules, and class-imbalance problem, the number of defect-free modules is much larger than that of defective ones. In this paper, we propose to use graph based semi-supervised learning technique to predict software defect. By using Laplacian score sampling strategy for the labeled defect-free modules, we construct a class-balance labeled training dataset firstly. And then, we use a nonnegative sparse algorithm to compute the nonnegative sparse weights of a relationship graph which serve as clustering indicators. Lastly, on the nonnegative sparse graph, we use a label propagation algorithm to iteratively predict the labels of unlabeled software modules. We thus propose a nonnegative sparse graph based label propagation approach for software defect classification and prediction, which uses not only few labeled data but also abundant unlabeled ones to improve the generalization capability. We vary the size of labeled software modules from 10 to 30 % of all the datasets in the widely used NASA projects. Experimental results show that the NSGLP outperforms several representative state-of-the-art semi-supervised software defect prediction methods, and it can fully exploit the characteristics of static code metrics and improve the generalization capability of the software defect prediction model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter. 6(1), 20–29 (2004)CrossRef Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter. 6(1), 20–29 (2004)CrossRef
Zurück zum Zitat Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)MathSciNetMATH Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)MathSciNetMATH
Zurück zum Zitat Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009a)CrossRef Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009a)CrossRef
Zurück zum Zitat Catal, C., Diri, B.: Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Syst. 26(5), 458–471 (2009b)CrossRef Catal, C., Diri, B.: Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Syst. 26(5), 458–471 (2009b)CrossRef
Zurück zum Zitat Catal, C.: A comparison of semi-supervised classification approaches for software defect prediction. J. Intell. Syst. 23(1), 75–82 (2014) Catal, C.: A comparison of semi-supervised classification approaches for software defect prediction. J. Intell. Syst. 23(1), 75–82 (2014)
Zurück zum Zitat Chan, Y., Walmsley, R.P.: Learning and understanding the Kruskal-Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Phys. Ther. 77(12), 1755–1761 (1997) Chan, Y., Walmsley, R.P.: Learning and understanding the Kruskal-Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Phys. Ther. 77(12), 1755–1761 (1997)
Zurück zum Zitat Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64 (2005) Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64 (2005)
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifici. Intell. Res. 16, 321–357 (2002)MATH Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifici. Intell. Res. 16, 321–357 (2002)MATH
Zurück zum Zitat Culp, M., Michailidis, G.: Graph-based semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 174–179 (2008)CrossRef Culp, M., Michailidis, G.: Graph-based semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 174–179 (2008)CrossRef
Zurück zum Zitat Fenton, N., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng. 26(8), 797–814 (2000)CrossRef Fenton, N., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng. 26(8), 797–814 (2000)CrossRef
Zurück zum Zitat Gao, K., Khoshgoftaar, T. M.: Software defect prediction for high-dimensional and class-imbalanced data. In: Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 89–94 (2011) Gao, K., Khoshgoftaar, T. M.: Software defect prediction for high-dimensional and class-imbalanced data. In: Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 89–94 (2011)
Zurück zum Zitat Gao, K., Khoshgoftaar, T.M., Wald, R.: The use of under- and oversampling within ensemble feature selection and classification for software quality prediction. Int. J. Reliab. Qual. Saf. Eng. 21(1), 145004 (2014)CrossRef Gao, K., Khoshgoftaar, T.M., Wald, R.: The use of under- and oversampling within ensemble feature selection and classification for software quality prediction. Int. J. Reliab. Qual. Saf. Eng. 21(1), 145004 (2014)CrossRef
Zurück zum Zitat Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, pp. 327–334 (2000) Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, pp. 327–334 (2000)
Zurück zum Zitat Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems, pp. 529–536 (2004) Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems, pp. 529–536 (2004)
Zurück zum Zitat Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Proceedings of 15th Annual Conference on Evaluation and Assessment in Software Engineering, pp. 96–103 (2011) Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Proceedings of 15th Annual Conference on Evaluation and Assessment in Software Engineering, pp. 96–103 (2011)
Zurück zum Zitat Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)CrossRef Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)CrossRef
Zurück zum Zitat He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514 (2005) He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514 (2005)
Zurück zum Zitat Jiang, Y., Li, M., Zhou, Z.H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)CrossRef Jiang, Y., Li, M., Zhou, Z.H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)CrossRef
Zurück zum Zitat Jing, X. Y., Ying, S., Zhang, Z. W., Wu, S. S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414-423 (2014a) Jing, X. Y., Ying, S., Zhang, Z. W., Wu, S. S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414-423 (2014a)
Zurück zum Zitat Jing, X. Y., Zhang, Z. W., Ying, S., Wang, F., Zhu, Y. P.: Software defect prediction based on collaborative representation classification. In: Companion Proceedings of the 36th International Conference on Software Engineering, pp. 632–633 (2014b) Jing, X. Y., Zhang, Z. W., Ying, S., Wang, F., Zhu, Y. P.: Software defect prediction based on collaborative representation classification. In: Companion Proceedings of the 36th International Conference on Software Engineering, pp. 632–633 (2014b)
Zurück zum Zitat Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, pp 200–209 (1999) Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, pp 200–209 (1999)
Zurück zum Zitat Khoshgoftaar, T. M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of the 22nd IEEE International Conference on Tools with Artificial Intelligence, pp. 137–144 (2010) Khoshgoftaar, T. M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of the 22nd IEEE International Conference on Tools with Artificial Intelligence, pp. 137–144 (2010)
Zurück zum Zitat Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp 179–186 (1997) Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp 179–186 (1997)
Zurück zum Zitat Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)CrossRef Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)CrossRef
Zurück zum Zitat Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)CrossRef Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)CrossRef
Zurück zum Zitat Li, S., Fu, Y.: Low-rank coding with b-matching constraint for semi-supervised classification. In: Proceedings of the 23th International Joint Conference on Artificial Intelligence, pp. 1472–1478 (2013) Li, S., Fu, Y.: Low-rank coding with b-matching constraint for semi-supervised classification. In: Proceedings of the 23th International Joint Conference on Artificial Intelligence, pp. 1472–1478 (2013)
Zurück zum Zitat Lu, H., Cukic, B., Culp, M.: An iterative semi-supervised approach to software fault prediction. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Article 15) (2011) Lu, H., Cukic, B., Culp, M.: An iterative semi-supervised approach to software fault prediction. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Article 15) (2011)
Zurück zum Zitat Lu, H., Cukic, B., Culp, M.: Software defect prediction using semi-supervised learning with dimension reduction. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 314–317 (2012) Lu, H., Cukic, B., Culp, M.: Software defect prediction using semi-supervised learning with dimension reduction. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 314–317 (2012)
Zurück zum Zitat Lyu, M. R.: Software reliability engineering: a roadmap. In: 2007 Future of Software Engineering, pp. 153–170 (2007) Lyu, M. R.: Software reliability engineering: a roadmap. In: 2007 Future of Software Engineering, pp. 153–170 (2007)
Zurück zum Zitat Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)CrossRef Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)CrossRef
Zurück zum Zitat Miller, D. J., Uyar, H. S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Advances in neural information processing systems, pp. 571–577 (1997) Miller, D. J., Uyar, H. S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Advances in neural information processing systems, pp. 571–577 (1997)
Zurück zum Zitat Nam, J., Pan, S. J., Kim, S.: Transfer defect learning. In: Proceedings of the 35th International Conference on Software Engineering, pp. 382–391 (2013) Nam, J., Pan, S. J., Kim, S.: Transfer defect learning. In: Proceedings of the 35th International Conference on Software Engineering, pp. 382–391 (2013)
Zurück zum Zitat Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)CrossRefMATH Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)CrossRefMATH
Zurück zum Zitat Pelayo, L, Dick, S.: Applying novel resampling strategies to software defect prediction. In: Proceedings of the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72 (2007) Pelayo, L, Dick, S.: Applying novel resampling strategies to software defect prediction. In: Proceedings of the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72 (2007)
Zurück zum Zitat Seliya, N., Khoshgoftaar, T.M.: Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw. Qual. J. 15(3), 327–344 (2007a)CrossRef Seliya, N., Khoshgoftaar, T.M.: Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw. Qual. J. 15(3), 327–344 (2007a)CrossRef
Zurück zum Zitat Seliya, N., Khoshgoftaar, T.M.: Software quality analysis of unlabeled program modules with semisupervised clustering. IEEE Trans. Syst. Man. Cyber. 37(2), 201–211 (2007b)CrossRef Seliya, N., Khoshgoftaar, T.M.: Software quality analysis of unlabeled program modules with semisupervised clustering. IEEE Trans. Syst. Man. Cyber. 37(2), 201–211 (2007b)CrossRef
Zurück zum Zitat Shahshahani, B.M., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)CrossRef Shahshahani, B.M., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)CrossRef
Zurück zum Zitat Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)CrossRef Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)CrossRef
Zurück zum Zitat Sun, Z.B., Song, Q.B., Zhu, X.Y.: Using coding based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cyber. C 42(6), 1806–1817 (2012)CrossRef Sun, Z.B., Song, Q.B., Zhu, X.Y.: Using coding based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cyber. C 42(6), 1806–1817 (2012)CrossRef
Zurück zum Zitat Turhan, B., Menzies, T., Bener, A.: On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Eng. 14(5), 540–578 (2009)CrossRef Turhan, B., Menzies, T., Bener, A.: On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Eng. 14(5), 540–578 (2009)CrossRef
Zurück zum Zitat Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20(1), 55–67 (2008)CrossRef Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20(1), 55–67 (2008)CrossRef
Zurück zum Zitat Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)CrossRef Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)CrossRef
Zurück zum Zitat Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)CrossRef Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)CrossRef
Zurück zum Zitat Xu, J., Man, H.: Dictionary learning based on laplacian score in sparse coding. In: Machine Learning and Data Mining in Pattern Recognition, pp.253–264 (2011) Xu, J., Man, H.: Dictionary learning based on laplacian score in sparse coding. In: Machine Learning and Data Mining in Pattern Recognition, pp.253–264 (2011)
Zurück zum Zitat Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004) Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)
Zurück zum Zitat Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)CrossRef Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)CrossRef
Zurück zum Zitat Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)CrossRef Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)CrossRef
Zurück zum Zitat Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University (2005) Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University (2005)
Zurück zum Zitat Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002) Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)
Zurück zum Zitat Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, pp. 912–919 (2003) Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, pp. 912–919 (2003)
Metadaten
Titel
Label propagation based semi-supervised learning for software defect prediction
verfasst von
Zhi-Wu Zhang
Xiao-Yuan Jing
Tie-Jian Wang
Publikationsdatum
22.03.2016
Verlag
Springer US
Erschienen in
Automated Software Engineering / Ausgabe 1/2017
Print ISSN: 0928-8910
Elektronische ISSN: 1573-7535
DOI
https://doi.org/10.1007/s10515-016-0194-x

Weitere Artikel der Ausgabe 1/2017

Automated Software Engineering 1/2017 Zur Ausgabe

Premium Partner