Skip to main content
Top
Published in: Software Quality Journal 4/2008

01-12-2008

Imputation techniques for multivariate missingness in software measurement data

Authors: Taghi M. Khoshgoftaar, Jason Van Hulse

Published in: Software Quality Journal | Issue 4/2008

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
The reasons for not using 40 and 50% missingness for MAR and NI missingness, which is related to constraints in the dataset size, are discussed in Sect. 3.4.
 
2
There are some works (Song et al. 2005), however, that use a similar evaluation methodology to the one we present.
 
Literature
go back to reference Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA. Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.
go back to reference Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer. Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.
go back to reference Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165. Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.
go back to reference Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley. Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.
go back to reference Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MATHMathSciNet Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MATHMathSciNet
go back to reference Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.CrossRef Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.CrossRef
go back to reference Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company. Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.
go back to reference Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118. Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.
go back to reference Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.CrossRef Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.CrossRef
go back to reference Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602. Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.
go back to reference Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV. Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.
go back to reference Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226. Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.
go back to reference Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27. Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.
go back to reference Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.MATH Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.MATH
go back to reference Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRef Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRef
go back to reference Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13. Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.
go back to reference Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
go back to reference SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc. SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.
go back to reference Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC. Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.
go back to reference Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.CrossRef Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.CrossRef
go back to reference Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.CrossRef Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.CrossRef
go back to reference Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.MATHMathSciNet Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.MATHMathSciNet
go back to reference Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40. Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.
go back to reference Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.MATH Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.MATH
go back to reference Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267. Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.
go back to reference Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29. Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.
go back to reference Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.MATHCrossRef Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.MATHCrossRef
Metadata
Title
Imputation techniques for multivariate missingness in software measurement data
Authors
Taghi M. Khoshgoftaar
Jason Van Hulse
Publication date
01-12-2008
Publisher
Springer US
Published in
Software Quality Journal / Issue 4/2008
Print ISSN: 0963-9314
Electronic ISSN: 1573-1367
DOI
https://doi.org/10.1007/s11219-008-9054-7

Other articles of this Issue 4/2008

Software Quality Journal 4/2008 Go to the issue

In this issue

In this issue

Premium Partner