Skip to main content
Top
Published in: Empirical Software Engineering 6/2017

14-04-2017

Data Transformation in Cross-project Defect Prediction

Authors: Feng Zhang, Iman Keivanloo, Ying Zou

Published in: Empirical Software Engineering | Issue 6/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Software metrics rarely follow a normal distribution. Therefore, software metrics are usually transformed prior to building a defect prediction model. To the best of our knowledge, the impact that the transformation has on cross-project defect prediction models has not been thoroughly explored. A cross-project model is built from one project and applied on another project. In this study, we investigate if cross-project defect prediction is affected by applying different transformations (i.e., log and rank transformations, as well as the Box-Cox transformation). The Box-Cox transformation subsumes log and other power transformations (e.g., square root), but has not been studied in the defect prediction literature. We propose an approach, namely Multiple Transformations (MT), to utilize multiple transformations for cross-project defect prediction. We further propose an enhanced approach MT+ to use the parameter of the Box-Cox transformation to determine the most appropriate training project for each target project. Our experiments are conducted upon three publicly available data sets (i.e., AEEEM, ReLink, and PROMISE). Comparing to the random forest model built solely using the log transformation, our MT+ approach improves the F-measure by 7, 59 and 43% for the three data sets, respectively. As a summary, our major contributions are three-fold: 1) conduct an empirical study on the impact that data transformation has on cross-project defect prediction models; 2) propose an approach to utilize the various information retained by applying different transformation methods; and 3) propose an unsupervised approach to select the most appropriate training project for each target project.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Bettenburg N, Nagappan M, Hassan AE (2012) Think locally, act globally: improving defect and effort prediction models Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12, pp 60–69CrossRef Bettenburg N, Nagappan M, Hassan AE (2012) Think locally, act globally: improving defect and effort prediction models Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12, pp 60–69CrossRef
go back to reference Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B Methodol 26(2):211–252MATH Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B Methodol 26(2):211–252MATH
go back to reference Breslow NE, Day NE (1980) Statistical methods in cancer research. vol. 1. the analysis of case-control studies. International Agency for Research on Cancer Scientific Publications 1(32):338 Breslow NE, Day NE (1980) Statistical methods in cancer research. vol. 1. the analysis of case-control studies. International Agency for Research on Cancer Scientific Publications 1(32):338
go back to reference Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction 2013 IEEE sixth international conference on software testing, verification and validation (ICST), pp 252–261CrossRef Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction 2013 IEEE sixth international conference on software testing, verification and validation (ICST), pp 252–261CrossRef
go back to reference Cohen J, Cohen P, West S, Aiken L (2003) Applied multiple Regression/Correlation analysis for the behavioral sciences, 3rd edn. Lawrence Erlbaum, Mahwah, NY, USA Cohen J, Cohen P, West S, Aiken L (2003) Applied multiple Regression/Correlation analysis for the behavioral sciences, 3rd edn. Lawrence Erlbaum, Mahwah, NY, USA
go back to reference Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10):687–708CrossRef Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10):687–708CrossRef
go back to reference Cruz A, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects ESEM 2009. 3rd international symposium on empirical software engineering and measurement 2009, pp 460–463CrossRef Cruz A, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects ESEM 2009. 3rd international symposium on empirical software engineering and measurement 2009, pp 460–463CrossRef
go back to reference D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches Proceedings of the 7th IEEE working conference on mining software repositories, MSR’10, pp 31– 41 D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches Proceedings of the 7th IEEE working conference on mining software repositories, MSR’10, pp 31– 41
go back to reference Fukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N (2014) An empirical study of just-in-time defect prediction using cross-project models Proceedings of the working conference on mining software repositories, ACM, MSR’14, pp 172–181 Fukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N (2014) An empirical study of just-in-time defect prediction using cross-project models Proceedings of the working conference on mining software repositories, ACM, MSR’14, pp 172–181
go back to reference Guo W (2014) A unified approach to data transformation and outlier detection using penalized assessment. PhD thesis University of Cincinnati, Arts and Sciences: Mathematical Sciences Guo W (2014) A unified approach to data transformation and outlier detection using penalized assessment. PhD thesis University of Cincinnati, Arts and Sciences: Mathematical Sciences
go back to reference Han J, Kamber M, Pei J (2012) Data Mining: concepts and techniques, 3rd edn. Morgan Kaufmann , BostonMATH Han J, Kamber M, Pei J (2012) Data Mining: concepts and techniques, 3rd edn. Morgan Kaufmann , BostonMATH
go back to reference He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199CrossRef He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199CrossRef
go back to reference He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: an empirical study on defect prediction 2013 ACM/IEEE international symposium on empirical software engineering and measurement, pp 45–54CrossRef He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: an empirical study on defect prediction 2013 ACM/IEEE international symposium on empirical software engineering and measurement, pp 45–54CrossRef
go back to reference Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York, NY, USACrossRefMATH Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York, NY, USACrossRefMATH
go back to reference Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules? Proceedings of the 2008 workshop on defects in large software systems, DEFECTS ’08, pp 16–20CrossRef Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules? Proceedings of the 2008 workshop on defects in large software systems, DEFECTS ’08, pp 16–20CrossRef
go back to reference Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE 2015, pp 496– 507CrossRef Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE 2015, pp 496– 507CrossRef
go back to reference Jing XY, Wu F, Dong X, Xu B (2016) An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Soft Eng PP(99):1–1 Jing XY, Wu F, Dong X, Xu B (2016) An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Soft Eng PP(99):1–1
go back to reference Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction Proceedings of the 6th international conference on predictive models in software engineering, PROMISE ’10, pp 9:1–9:10 Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction Proceedings of the 6th international conference on predictive models in software engineering, PROMISE ’10, pp 9:1–9:10
go back to reference Keren G, Lewis C (1993) A handbook for data analysis in the behavioral sciences: statistical issues. Lawrence Erlbaum Hillsdale, NY, USAMATH Keren G, Lewis C (1993) A handbook for data analysis in the behavioral sciences: statistical issues. Lawrence Erlbaum Hillsdale, NY, USAMATH
go back to reference Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction Proceedings of the 33rd international conference on software engineering, ICSE ’11, pp 481–490 Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction Proceedings of the 33rd international conference on software engineering, ICSE ’11, pp 481–490
go back to reference Kuhn M, Johnson K (2013) Data pre-processing Applied predictive modeling. Springer, New York, pp 27–59CrossRef Kuhn M, Johnson K (2013) Data pre-processing Applied predictive modeling. Springer, New York, pp 27–59CrossRef
go back to reference Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26CrossRef Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26CrossRef
go back to reference Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256CrossRef Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256CrossRef
go back to reference Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13CrossRef Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13CrossRef
go back to reference Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2013) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834CrossRef Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2013) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834CrossRef
go back to reference Misirli AT, Bener AB, Turhan B (2011) An industrial case study of classifier ensembles for locating software defects. Softw Qual J 19(3):515–536CrossRef Misirli AT, Bener AB, Turhan B (2011) An industrial case study of classifier ensembles for locating software defects. Softw Qual J 19(3):515–536CrossRef
go back to reference Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction Proceedings of the 30th international conference on software engineering, ICSE ’08, pp 181–190 Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction Proceedings of the 30th international conference on software engineering, ICSE ’08, pp 181–190
go back to reference Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures Proceedings of the 28th international conference on software engineering, ACM, ICSE ’06, pp 452–461CrossRef Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures Proceedings of the 28th international conference on software engineering, ACM, ICSE ’06, pp 452–461CrossRef
go back to reference Nam J, Kim S (2015) Heterogeneous defect prediction Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE, 2015, pp 508–519CrossRef Nam J, Kim S (2015) Heterogeneous defect prediction Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE, 2015, pp 508–519CrossRef
go back to reference Nam J, Pan SJ, Kim S (2013) Transfer defect learning Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 382–391 Nam J, Pan SJ, Kim S (2013) Transfer defect learning Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 382–391
go back to reference Osborne JW (2008) 13 best practices in data transformation: the overlooked effect of minimum values, 0 edn, SAGE Publications, Inc., pp 197–205 Osborne JW (2008) 13 best practices in data transformation: the overlooked effect of minimum values, 0 edn, SAGE Publications, Inc., pp 197–205
go back to reference Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Practical Assessment Research & Evaluation 15(12) Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Practical Assessment Research & Evaluation 15(12)
go back to reference Panichella A, Oliveto R, De Lucia A (2014) Cross-project defect prediction models: L’union fait la force 2014 software evolution week - IEEE conference on software maintenance, reengineering and reverse engineering (CSMR-WCRE), pp 164–173CrossRef Panichella A, Oliveto R, De Lucia A (2014) Cross-project defect prediction models: L’union fait la force 2014 software evolution week - IEEE conference on software maintenance, reengineering and reverse engineering (CSMR-WCRE), pp 164–173CrossRef
go back to reference Rahman F, Posnett D, Devanbu P (2012) Recalling the “imprecision” of cross-project defect prediction Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12, pp 61:1–61:11 Rahman F, Posnett D, Devanbu P (2012) Recalling the “imprecision” of cross-project defect prediction Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12, pp 61:1–61:11
go back to reference Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? meeting of the Florida association of institutional research, pp 1–33 Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? meeting of the Florida association of institutional research, pp 1–33
go back to reference Selim G, Barbour L, Shang W, Adams B, Hassan A, Zou Y (2010) Studying the impact of clones on software defects Proceeddings of the 17th working conference on reverse engineering, pp 13–21 Selim G, Barbour L, Shang W, Adams B, Hassan A, Zou Y (2010) Studying the impact of clones on software defects Proceeddings of the 17th working conference on reverse engineering, pp 13–21
go back to reference Shang H (2014) Selection of the optimal box–cox transformation parameter for modelling and forecasting age-specific fertility. J Popul Res pp 1–11 Shang H (2014) Selection of the optimal box–cox transformation parameter for modelling and forecasting age-specific fertility. J Popul Res pp 1–11
go back to reference Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall/CRC Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall/CRC
go back to reference Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370CrossRef Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370CrossRef
go back to reference Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer object-oriented metrics suite. Empir Softw Eng 10(1):81–104CrossRef Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer object-oriented metrics suite. Empir Softw Eng 10(1):81–104CrossRef
go back to reference Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models Proceedings of the 38th international conference on software engineering, ACM, ICSE’16, pp 321–332 Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models Proceedings of the 38th international conference on software engineering, ACM, ICSE’16, pp 321–332
go back to reference Triola M (2004) Elementary statistics. Pearson/Addison-Wesley Triola M (2004) Elementary statistics. Pearson/Addison-Wesley
go back to reference Turhan B, Misirli AT, Bener AB (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118CrossRef Turhan B, Misirli AT, Bener AB (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118CrossRef
go back to reference Wu R, Zhang H, Kim S, Cheung SC (2011) Relink: recovering links between bugs and changes Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11, pp 15–25 Wu R, Zhang H, Kim S, Cheung SC (2011) Relink: recovering links between bugs and changes Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11, pp 15–25
go back to reference Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef
go back to reference Yin RK (2002) Case study research: design and methods, 3rd edn. SAGE Publications Yin RK (2002) Case study research: design and methods, 3rd edn. SAGE Publications
go back to reference Zhang F, Mockus A, Zou Y, Khomh F, Hassan AE (2013) How does context affect the distribution of software maintainability metrics? Proceedings of the 29th IEEE international conference on software maintainability, ICSM ’13, pp 350–359 Zhang F, Mockus A, Zou Y, Khomh F, Hassan AE (2013) How does context affect the distribution of software maintainability metrics? Proceedings of the 29th IEEE international conference on software maintainability, ICSM ’13, pp 350–359
go back to reference Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model Proceedings of the 11th working conference on mining software repositories, MSR ’14, pp 41–50 Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model Proceedings of the 11th working conference on mining software repositories, MSR ’14, pp 41–50
go back to reference Zhang F, Mockus A, Keivanloo I, Zou Y (2015) Towards building a universal defect prediction model with rank transformed predictors. Empir Soft Eng pp 1–39 Zhang F, Mockus A, Keivanloo I, Zou Y (2015) Towards building a universal defect prediction model with rank transformed predictors. Empir Soft Eng pp 1–39
go back to reference Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier Proceedings of the 38th international conference on software engineering, ICSE ’16, pp 309–320 Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier Proceedings of the 38th international conference on software engineering, ICSE ’16, pp 309–320
go back to reference Zhang H (2009) Discovering power laws in computer programs. Inf Process Manag 45(4):477–483CrossRef Zhang H (2009) Discovering power laws in computer programs. Inf Process Manag 45(4):477–483CrossRef
go back to reference Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09, pp 91–100 Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09, pp 91–100
Metadata
Title
Data Transformation in Cross-project Defect Prediction
Authors
Feng Zhang
Iman Keivanloo
Ying Zou
Publication date
14-04-2017
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 6/2017
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-017-9516-2

Other articles of this Issue 6/2017

Empirical Software Engineering 6/2017 Go to the issue

Premium Partner