Skip to main content
Top
Published in: Soft Computing 10/2018

08-03-2018 | Methodologies and Application

Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Authors: Xiao Yu, Man Wu, Yiheng Jian, Kwabena Ebo Bennin, Mandi Fu, Chuanxiang Ma

Published in: Soft Computing | Issue 10/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to a target company. Unfortunately, larger irrelevant cross-company (CC) data usually make it difficult to build a prediction model with high performance. On the other hand, brute force leveraging of CC data poorly related to within-company data may decrease the prediction model performance. To address such issues, we aim to provide an effective solution for CCDP. First, we propose a novel semi-supervised clustering-based data filtering method (i.e., SSDBSCAN filter) to filter out irrelevant CC data. Second, based on the filtered CC data, we for the first time introduce multi-source TrAdaBoost algorithm, an effective transfer learning method, into CCDP to import knowledge not from one but from multiple sources to avoid negative transfer. Experiments on 15 public datasets indicate that: (1) our proposed SSDBSCAN filter achieves better overall performance than compared data filtering methods; (2) our proposed CCDP approach achieves the best overall performance among all tested CCDP approaches; and (3) our proposed CCDP approach performs significantly better than with-company defect prediction models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Arar Ömer Faruk, Ayan Kürşat (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277CrossRef Arar Ömer Faruk, Ayan Kürşat (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277CrossRef
go back to reference Bennin K, Keung J, Monden A, et al (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In:11th International symposium on empirical software engineering and measurement, ESEM 2017 Bennin K, Keung J, Monden A, et al (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In:11th International symposium on empirical software engineering and measurement, ESEM 2017
go back to reference Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720CrossRef Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720CrossRef
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
go back to reference Chen L, Fang B, Shang Z et al (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62:67–77CrossRef Chen L, Fang B, Shang Z et al (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62:67–77CrossRef
go back to reference Dai W et al (2007) Boosting for transfer learning. In: 24th International conference on Machine learning, pp 193–200 Dai W et al (2007) Boosting for transfer learning. In: 24th International conference on Machine learning, pp 193–200
go back to reference Dhanajayan RCG, Pillai SA (2016) SLMBC: spiral life cycle model-based Bayesian classification technique for efficient software fault prediction and classification, Soft Computing, 1-13 Dhanajayan RCG, Pillai SA (2016) SLMBC: spiral life cycle model-based Bayesian classification technique for efficient software fault prediction and classification, Soft Computing, 1-13
go back to reference Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. Softw J Syst Softw 81(5):649–660CrossRef Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. Softw J Syst Softw 81(5):649–660CrossRef
go back to reference Erturk Ezgi, Sezer Ebru Akcapinar (2016) Iterative software fault prediction with a hybrid approach. Appl Soft Comput 49:1020–1033CrossRef Erturk Ezgi, Sezer Ebru Akcapinar (2016) Iterative software fault prediction with a hybrid approach. Appl Soft Comput 49:1020–1033CrossRef
go back to reference Field AP (2001) Discovering statistics using SPSS for windows: advanced techniques for beginners, pp 551–552 Field AP (2001) Discovering statistics using SPSS for windows: advanced techniques for beginners, pp 551–552
go back to reference Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753CrossRefMATH Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753CrossRefMATH
go back to reference Gray D, Bowes D, Davey N, et al (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks. Springer, Berlin, pp 223–234 Gray D, Bowes D, Davey N, et al (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks. Springer, Berlin, pp 223–234
go back to reference Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304CrossRef Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304CrossRef
go back to reference Hosmer DW, Lemeshow S (2000) Introduction to the logistic regression model. Appl Logist Regres 1–30 Hosmer DW, Lemeshow S (2000) Introduction to the logistic regression model. Appl Logist Regres 1–30
go back to reference Jain K (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain K (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef
go back to reference Jing X et al (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th joint meeting on foundations of software engineering, pp 496–507 Jing X et al (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th joint meeting on foundations of software engineering, pp 496–507
go back to reference Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp 414–423 Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp 414–423
go back to reference Kampenes V By et al (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11):1073–1086CrossRef Kampenes V By et al (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11):1073–1086CrossRef
go back to reference Kawata K, Amasaki S, Yokogawa T (2016) Improving relevancy filter methods for cross-project defect prediction, applied computing & information technology, pp 1–12 Kawata K, Amasaki S, Yokogawa T (2016) Improving relevancy filter methods for cross-project defect prediction, applied computing & information technology, pp 1–12
go back to reference Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58:388–402CrossRef Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58:388–402CrossRef
go back to reference Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: 9th IEEE international conference on data mining, pp 842–847 Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: 9th IEEE international conference on data mining, pp 842–847
go back to reference Lewis DD (1998) Naive (Bayes) at forty the independence assumption in information retrieval. In: European conference on machine learning, pp 4–15 Lewis DD (1998) Naive (Bayes) at forty the independence assumption in information retrieval. In: European conference on machine learning, pp 4–15
go back to reference Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256CrossRef Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256CrossRef
go back to reference Malhotra Ruchika (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518CrossRef Malhotra Ruchika (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518CrossRef
go back to reference Mesquita PP Diego et al (2016) Classification with reject option for software defect prediction. Appl Soft Comput 49:1085–1093CrossRef Mesquita PP Diego et al (2016) Classification with reject option for software defect prediction. Appl Soft Comput 49:1085–1093CrossRef
go back to reference Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 382–391 Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 382–391
go back to reference Peng L, Yang B, Chen Y, Abraham A (2009) Data gravitation based classification. Inf Sci 179(6):809–819CrossRefMATH Peng L, Yang B, Chen Y, Abraham A (2009) Data gravitation based classification. Inf Sci 179(6):809–819CrossRefMATH
go back to reference Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th international workshop on mining software repositories, pp 409–418 Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th international workshop on mining software repositories, pp 409–418
go back to reference Ryu D, Jang JI, Baik J (2015) A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J Comput Sci Technol 30(5):969–980CrossRef Ryu D, Jang JI, Baik J (2015) A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J Comput Sci Technol 30(5):969–980CrossRef
go back to reference Seliya N, Khoshgoftaar TM (2011) The use of decision trees for cost- sensitive classification: an empirical study in software quality prediction. Wiley Interdiscip Rev Data Min Knowl Discov 1(5):448–459CrossRef Seliya N, Khoshgoftaar TM (2011) The use of decision trees for cost- sensitive classification: an empirical study in software quality prediction. Wiley Interdiscip Rev Data Min Knowl Discov 1(5):448–459CrossRef
go back to reference Shepperd M, Bowes D, Hall T (2014) Researcher bias The use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616CrossRef Shepperd M, Bowes D, Hall T (2014) Researcher bias The use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616CrossRef
go back to reference Shukla S, Radhakrishnan T, Muthukumaran K, et al (2016) Multi-objective cross-version defect prediction, Soft Computing 1-22 Shukla S, Radhakrishnan T, Muthukumaran K, et al (2016) Multi-objective cross-version defect prediction, Soft Computing 1-22
go back to reference Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 51:62–71CrossRef Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 51:62–71CrossRef
go back to reference Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370CrossRef Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370CrossRef
go back to reference Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817CrossRef Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817CrossRef
go back to reference Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578CrossRef Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578CrossRef
go back to reference Turhan B, Tosun Mısırlı A, Bener A (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118CrossRef Turhan B, Tosun Mısırlı A, Bener A (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118CrossRef
go back to reference Vashisht V, Lal M, Sureshchandar GS et al (2015) A framework for software defect prediction using neural networks. J Softw Eng Appl 8(8):384CrossRef Vashisht V, Lal M, Sureshchandar GS et al (2015) A framework for software defect prediction using neural networks. J Softw Eng Appl 8(8):384CrossRef
go back to reference Wang J, Shen B, Chen Y (2012) Compressed C4. 5 models for software defect prediction. In: 12th international conference on quality software, pp 13–16 Wang J, Shen B, Chen Y (2012) Compressed C4. 5 models for software defect prediction. In: 12th international conference on quality software, pp 13–16
go back to reference Wilcoxon Frank (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef Wilcoxon Frank (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef
go back to reference Yan Z, Chen X, Guo P (2010) Software defect prediction using fuzzy support vector regression. In: International Symposium on Neural Networks. Springer, Berlin Heidelberg, pp 17–24 Yan Z, Chen X, Guo P (2010) Software defect prediction using fuzzy support vector regression. In: International Symposium on Neural Networks. Springer, Berlin Heidelberg, pp 17–24
go back to reference Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: IEEE conference on computer vision and pattern recognition, pp 1855–1862 Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: IEEE conference on computer vision and pattern recognition, pp 1855–1862
go back to reference Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 91–100 Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 91–100
Metadata
Title
Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning
Authors
Xiao Yu
Man Wu
Yiheng Jian
Kwabena Ebo Bennin
Mandi Fu
Chuanxiang Ma
Publication date
08-03-2018
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 10/2018
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-3093-1

Other articles of this Issue 10/2018

Soft Computing 10/2018 Go to the issue

Premium Partner