Skip to main content
Erschienen in: Empirical Software Engineering 3/2006

01.09.2006

Benchmarking k-nearest neighbour imputation with homogeneous Likert data

verfasst von: Per Jönsson, Claes Wohlin

Erschienen in: Empirical Software Engineering | Ausgabe 3/2006

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9 Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9
Zurück zum Zitat Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165 Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165
Zurück zum Zitat Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327CrossRef Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327CrossRef
Zurück zum Zitat Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131MATH Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131MATH
Zurück zum Zitat De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160CrossRef De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160CrossRef
Zurück zum Zitat Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191CrossRef Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191CrossRef
Zurück zum Zitat Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY
Zurück zum Zitat Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976CrossRef Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976CrossRef
Zurück zum Zitat Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107CrossRef Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107CrossRef
Zurück zum Zitat Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381CrossRef Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381CrossRef
Zurück zum Zitat Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313 Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313
Zurück zum Zitat Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351CrossRef Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351CrossRef
Zurück zum Zitat Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118 Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118
Zurück zum Zitat Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357 Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357
Zurück zum Zitat Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013CrossRef Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013CrossRef
Zurück zum Zitat Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748 Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748
Zurück zum Zitat Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA
Zurück zum Zitat Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350 Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350
Zurück zum Zitat Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160 Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160
Zurück zum Zitat Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243CrossRef Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243CrossRef
Zurück zum Zitat Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908CrossRef Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908CrossRef
Zurück zum Zitat Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525CrossRef Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525CrossRef
Zurück zum Zitat Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34MATHMathSciNet Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34MATHMathSciNet
Metadaten
Titel
Benchmarking k-nearest neighbour imputation with homogeneous Likert data
verfasst von
Per Jönsson
Claes Wohlin
Publikationsdatum
01.09.2006
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 3/2006
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-006-9001-9

Weitere Artikel der Ausgabe 3/2006

Empirical Software Engineering 3/2006 Zur Ausgabe

Premium Partner