Skip to main content
Erschienen in: Empirical Software Engineering 4/2013

01.08.2013

On the value of outlier elimination on software effort estimation research

verfasst von: Yeong-Seok Seo, Doo-Hwan Bae

Erschienen in: Empirical Software Engineering | Ausgabe 4/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook’s distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The situation of Fig. 2 may be caused by the data points with very different types of solutions. This may not necessarily to do with outliers. However, the situation can be recognized as part of the outlier problem according to the definition of outliers with respect to applications in software organizations. That is, as the definition of outliers is subjective and usually different in each software organization, the data points with very different types of solutions can be identified as outliers.
 
2
CMMI is awarded by Carnegie Mellon University’s Software Engineering Institute (SEI) and is a software development process improvement approach for which the goal is to help organizations improve their performance. At maturity level 3, the organization’s set of standard processes is well established and improved over time. Projects establish their defined processes by tailoring the organization’s set of standard processes according to tailoring guidelines (Chrissis et al. 2003).
 
3
Note that, when K is equal to 1 and any similarity function is selected, all of the calculations for the final effort estimate (mean, median, and weighted mean) give the same results. Moreover, when K is equal to 2 and any similarity function is selected, the mean and the median give the same results.
 
Literatur
Zurück zum Zitat Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338MATHCrossRef Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338MATHCrossRef
Zurück zum Zitat Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge
Zurück zum Zitat Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
Zurück zum Zitat Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165 Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165
Zurück zum Zitat Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488 Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488
Zurück zum Zitat Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640CrossRef Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640CrossRef
Zurück zum Zitat Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley Professional Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley Professional
Zurück zum Zitat Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing Company Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing Company
Zurück zum Zitat de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776 de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776
Zurück zum Zitat Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of Montreal Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of Montreal
Zurück zum Zitat Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications Ltd Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications Ltd
Zurück zum Zitat Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995CrossRef Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995CrossRef
Zurück zum Zitat Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury Press Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury Press
Zurück zum Zitat Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann
Zurück zum Zitat Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045CrossRef Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045CrossRef
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
Zurück zum Zitat Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016CrossRef Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016CrossRef
Zurück zum Zitat Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27 Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27
Zurück zum Zitat Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53CrossRef Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53CrossRef
Zurück zum Zitat Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484CrossRef Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484CrossRef
Zurück zum Zitat Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130CrossRef Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130CrossRef
Zurück zum Zitat Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of Otago Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of Otago
Zurück zum Zitat Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438CrossRef Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438CrossRef
Zurück zum Zitat Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338 Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338
Zurück zum Zitat Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252CrossRef Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252CrossRef
Zurück zum Zitat Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458CrossRef Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458CrossRef
Zurück zum Zitat Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84 Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84
Zurück zum Zitat MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98CrossRef MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98CrossRef
Zurück zum Zitat Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518 Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518
Zurück zum Zitat Maxwell KD (2002) Applied statistics for software managers. Prentice Hall Maxwell KD (2002) Applied statistics for software managers. Prentice Hall
Zurück zum Zitat Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37CrossRef Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37CrossRef
Zurück zum Zitat Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139 Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139
Zurück zum Zitat Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437CrossRef Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437CrossRef
Zurück zum Zitat Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351 Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351
Zurück zum Zitat Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79 Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79
Zurück zum Zitat Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243CrossRef Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243CrossRef
Zurück zum Zitat Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16CrossRef Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16CrossRef
Zurück zum Zitat Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391CrossRef Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391CrossRef
Zurück zum Zitat Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury Press Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury Press
Zurück zum Zitat Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624CrossRef Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624CrossRef
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65MATHCrossRef Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65MATHCrossRef
Zurück zum Zitat Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45MathSciNetCrossRef Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45MathSciNetCrossRef
Zurück zum Zitat Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32 Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32
Zurück zum Zitat Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10 Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10
Zurück zum Zitat Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611MathSciNetMATH Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611MathSciNetMATH
Zurück zum Zitat Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRef Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRef
Zurück zum Zitat Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743CrossRef Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743CrossRef
Zurück zum Zitat Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908CrossRef Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908CrossRef
Zurück zum Zitat Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708CrossRef Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708CrossRef
Zurück zum Zitat Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186 Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186
Metadaten
Titel
On the value of outlier elimination on software effort estimation research
verfasst von
Yeong-Seok Seo
Doo-Hwan Bae
Publikationsdatum
01.08.2013
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 4/2013
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-012-9207-y

Weitere Artikel der Ausgabe 4/2013

Empirical Software Engineering 4/2013 Zur Ausgabe