nach oben

Empirical Software Engineering

Erschienen in:

01.09.2006

Benchmarking k-nearest neighbour imputation with homogeneous Likert data

verfasst von: Per Jönsson, Claes Wohlin

Erschienen in: Empirical Software Engineering | Ausgabe 3/2006

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.

Vorheriger Artikel An empirical study of variations in COTS-based software development processes in the Norwegian IT industry

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9

Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165

Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327CrossRef

Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131MATH

De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160CrossRef

Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191CrossRef

Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY

Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976CrossRef

Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107CrossRef

Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381CrossRef

Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313

Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351CrossRef

Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118

Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357

Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013CrossRef

Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748

Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA

Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350

Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160

Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243CrossRef

Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908CrossRef

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525CrossRef

Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34MATHMathSciNet

Titel: Benchmarking k-nearest neighbour imputation with homogeneous Likert data
verfasst von: Per Jönsson
Claes Wohlin
Publikationsdatum: 01.09.2006
Verlag: Springer US
Erschienen in: Empirical Software Engineering / Ausgabe 3/2006
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-006-9001-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2006

Subjective evaluation of software evolvability using code smells: An empirical study

EMSE special issue from ICSM and Metrics

An empirical study of variations in COTS-based software development processes in the Norwegian IT industry

Replaying development history to assess the effectiveness of change propagation tools

An empirical study of fine-grained software modifications

Premium Partner