Skip to main content
Erschienen in: Software Quality Journal 4/2008

01.12.2008

Imputation techniques for multivariate missingness in software measurement data

verfasst von: Taghi M. Khoshgoftaar, Jason Van Hulse

Erschienen in: Software Quality Journal | Ausgabe 4/2008

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
The reasons for not using 40 and 50% missingness for MAR and NI missingness, which is related to constraints in the dataset size, are discussed in Sect. 3.4.
 
2
There are some works (Song et al. 2005), however, that use a similar evaluation methodology to the one we present.
 
Literatur
Zurück zum Zitat Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA. Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.
Zurück zum Zitat Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer. Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.
Zurück zum Zitat Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165. Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.
Zurück zum Zitat Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley. Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.
Zurück zum Zitat Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MATHMathSciNet Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MATHMathSciNet
Zurück zum Zitat Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.CrossRef Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.CrossRef
Zurück zum Zitat Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company. Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.
Zurück zum Zitat Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118. Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.
Zurück zum Zitat Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.CrossRef Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.CrossRef
Zurück zum Zitat Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602. Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.
Zurück zum Zitat Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV. Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.
Zurück zum Zitat Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226. Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.
Zurück zum Zitat Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27. Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.
Zurück zum Zitat Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.MATH Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.MATH
Zurück zum Zitat Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRef Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRef
Zurück zum Zitat Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13. Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.
Zurück zum Zitat Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Zurück zum Zitat SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc. SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.
Zurück zum Zitat Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC. Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.
Zurück zum Zitat Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.CrossRef Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.CrossRef
Zurück zum Zitat Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.CrossRef Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.CrossRef
Zurück zum Zitat Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.MATHMathSciNet Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.MATHMathSciNet
Zurück zum Zitat Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40. Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.
Zurück zum Zitat Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.MATH Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.MATH
Zurück zum Zitat Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267. Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.
Zurück zum Zitat Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29. Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.
Zurück zum Zitat Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.MATHCrossRef Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.MATHCrossRef
Metadaten
Titel
Imputation techniques for multivariate missingness in software measurement data
verfasst von
Taghi M. Khoshgoftaar
Jason Van Hulse
Publikationsdatum
01.12.2008
Verlag
Springer US
Erschienen in
Software Quality Journal / Ausgabe 4/2008
Print ISSN: 0963-9314
Elektronische ISSN: 1573-1367
DOI
https://doi.org/10.1007/s11219-008-9054-7

Weitere Artikel der Ausgabe 4/2008

Software Quality Journal 4/2008 Zur Ausgabe

In this issue

In this issue