Skip to main content
Erschienen in: Knowledge and Information Systems 2/2016

01.02.2016 | Regular Paper

Missing value imputation using a fuzzy clustering-based EM approach

verfasst von: Md. Geaur Rahman, Md Zahidul Islam

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and \(t\) test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749MathSciNetMATH Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749MathSciNetMATH
4.
Zurück zum Zitat Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef
5.
Zurück zum Zitat Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203CrossRef Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203CrossRef
6.
Zurück zum Zitat Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126 Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126
7.
Zurück zum Zitat Bø TH, Dysvik B, Jonassen I (2004) Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3):e34–e34CrossRef Bø TH, Dysvik B, Jonassen I (2004) Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3):e34–e34CrossRef
8.
Zurück zum Zitat Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2013) In-network outlier detection in wireless sensor networks. Knowl Inf Syst 34(1):23–54CrossRef Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2013) In-network outlier detection in wireless sensor networks. Knowl Inf Syst 34(1):23–54CrossRef
9.
Zurück zum Zitat Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4(5):935–958CrossRef Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4(5):935–958CrossRef
11.
Zurück zum Zitat Chatzis SP (2011) The fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38:8684–8689CrossRef Chatzis SP (2011) The fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38:8684–8689CrossRef
13.
Zurück zum Zitat Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227CrossRef Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227CrossRef
14.
Zurück zum Zitat Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetMATH Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetMATH
16.
Zurück zum Zitat Han J, Kamber M (2000) Data: mining Concepts and techniques. The Morgan Kaufmann Series in data management systems 2 Han J, Kamber M (2000) Data: mining Concepts and techniques. The Morgan Kaufmann Series in data management systems 2
17.
Zurück zum Zitat Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336CrossRef Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336CrossRef
18.
Zurück zum Zitat Honaker J, King G (2010) What to do about missing values in time-series cross-section data. Am J Polit Sci 54(2):561–581CrossRef Honaker J, King G (2010) What to do about missing values in time-series cross-section data. Am J Polit Sci 54(2):561–581CrossRef
19.
Zurück zum Zitat Hourani M, El Emary IM (2009) Microarray missing values imputation methods: critical analysis review. Comput Sci Inf Syst ComSIS 6(2):165–190CrossRef Hourani M, El Emary IM (2009) Microarray missing values imputation methods: critical analysis review. Comput Sci Inf Syst ComSIS 6(2):165–190CrossRef
20.
Zurück zum Zitat Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRef Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRef
21.
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs NJMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs NJMATH
22.
Zurück zum Zitat Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907CrossRef Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907CrossRef
23.
Zurück zum Zitat Khoshgoftaar T, Van Hulse J (2005) Empirical case studies in attribute noise detection. In: IRI-2005 IEEE international conference on information reuse and integration, conf, 2005. IEEE, pp 211–216 Khoshgoftaar T, Van Hulse J (2005) Empirical case studies in attribute noise detection. In: IRI-2005 IEEE international conference on information reuse and integration, conf, 2005. IEEE, pp 211–216
24.
Zurück zum Zitat Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271CrossRef Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271CrossRef
25.
Zurück zum Zitat Kim H, Golub G, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198CrossRef Kim H, Golub G, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198CrossRef
26.
Zurück zum Zitat Lee M, Pedrycz W (2009) The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst 160(24):3590–3600MathSciNetCrossRefMATH Lee M, Pedrycz W (2009) The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst 160(24):3590–3600MathSciNetCrossRefMATH
27.
Zurück zum Zitat Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) RSCTC 2004, LNAI, vol 3066. Springer, Berlin, Heidelberg, pp 573–579 Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) RSCTC 2004, LNAI, vol 3066. Springer, Berlin, Heidelberg, pp 573–579
28.
Zurück zum Zitat Li L, Huang L, Yang W, Yao X, Liu A (2013) Privacy-preserving lof outlier detection. Knowl Inf Syst 42(3):579–597CrossRef Li L, Huang L, Yang W, Yao X, Liu A (2013) Privacy-preserving lof outlier detection. Knowl Inf Syst 42(3):579–597CrossRef
29.
Zurück zum Zitat Liu B, Xiao Y, Cao L, Hao Z, Deng F (2013) SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34(3):597–618CrossRef Liu B, Xiao Y, Cao L, Hao Z, Deng F (2013) SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34(3):597–618CrossRef
30.
Zurück zum Zitat Lu Y, Roychowdhury V (2008) Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl Inf Syst 14(2):233–247CrossRef Lu Y, Roychowdhury V (2008) Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl Inf Syst 14(2):233–247CrossRef
31.
Zurück zum Zitat Luengo J, García S, Herrera F (2011) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108CrossRef Luengo J, García S, Herrera F (2011) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108CrossRef
32.
Zurück zum Zitat Maletic J, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Proceedings of the conference on information quality. Citeseer, pp 200–209 Maletic J, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Proceedings of the conference on information quality. Citeseer, pp 200–209
33.
Zurück zum Zitat Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096CrossRef Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096CrossRef
34.
Zurück zum Zitat Pham DT, Dimov SS, Nguyen C (2005) Selection of k in k-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–119CrossRef Pham DT, Dimov SS, Nguyen C (2005) Selection of k in k-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–119CrossRef
36.
Zurück zum Zitat Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: International conference on computer science and information technology (CSIT-2013). Yogyakarta, Indonesia, pp 82–88 Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: International conference on computer science and information technology (CSIT-2013). Yogyakarta, Indonesia, pp 82–88
37.
Zurück zum Zitat Rahman MG, Islam MZ (2013) KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: The 9th international conference on advanced data mining and applications (ADMA 2013) Hangzhou, China Rahman MG, Islam MZ (2013) KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: The 9th international conference on advanced data mining and applications (ADMA 2013) Hangzhou, China
39.
Zurück zum Zitat Rahman MG, Islam MZ (2013) A novel framework using two layers of missing value imputation. In: Australasian data mining conference (AusDM 13), CRPIT, vol 146. ACS, Canberra, Australia Rahman MG, Islam MZ (2013) A novel framework using two layers of missing value imputation. In: Australasian data mining conference (AusDM 13), CRPIT, vol 146. ACS, Canberra, Australia
40.
Zurück zum Zitat Rahman MG, Islam MZ, Bossomaier T, Gao J (2012) Cairad: a co-appearance based analysis for incorrect records and attribute-values detection. In: The 2012 international joint conference on neural networks (IJCNN). IEEE, Brisbane, Australia, pp 1–10. doi:10.1109/IJCNN.2012.6252669 Rahman MG, Islam MZ, Bossomaier T, Gao J (2012) Cairad: a co-appearance based analysis for incorrect records and attribute-values detection. In: The 2012 international joint conference on neural networks (IJCNN). IEEE, Brisbane, Australia, pp 1–10. doi:10.​1109/​IJCNN.​2012.​6252669
41.
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH
43.
Zurück zum Zitat Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef
44.
45.
Zurück zum Zitat Sun H, Wang S, Jiang Q (2004) Fcm-based model selection algorithms for determining the number of clusters. Pattern Recognit 37(10):2027–2037CrossRefMATH Sun H, Wang S, Jiang Q (2004) Fcm-based model selection algorithms for determining the number of clusters. Pattern Recognit 37(10):2027–2037CrossRefMATH
46.
Zurück zum Zitat Triola MF, Goodman WM, LaBute G, Law R, MacKay L (2006) Elementary statistics. Pearson/Addison-Wesley, Reading, MA Triola MF, Goodman WM, LaBute G, Law R, MacKay L (2006) Elementary statistics. Pearson/Addison-Wesley, Reading, MA
47.
Zurück zum Zitat Tseng S, Wang K, Lee CI (2003) A pre-processing method to deal with missing values by integrating clustering and regression techniques. Appl Artif Intell 17(5–6):535–544CrossRef Tseng S, Wang K, Lee CI (2003) A pre-processing method to deal with missing values by integrating clustering and regression techniques. Appl Artif Intell 17(5–6):535–544CrossRef
48.
Zurück zum Zitat Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2):221–233 Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2):221–233
49.
Zurück zum Zitat Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(1):32CrossRef Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(1):32CrossRef
50.
Zurück zum Zitat Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
51.
Zurück zum Zitat Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE international conference on industrial informatics. IEEE, pp 1081–1086 (2006) Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE international conference on industrial informatics. IEEE, pp 1081–1086 (2006)
52.
Zurück zum Zitat Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133CrossRefMATH Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133CrossRefMATH
53.
Zurück zum Zitat Zhang S (2012) Nearest neighbor selection for iteratively k-nn imputation. J Syst Softw 85(11):2541–2552CrossRef Zhang S (2012) Nearest neighbor selection for iteratively k-nn imputation. J Syst Softw 85(11):2541–2552CrossRef
54.
Zurück zum Zitat Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459CrossRef Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459CrossRef
55.
Zurück zum Zitat Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121CrossRef Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121CrossRef
Metadaten
Titel
Missing value imputation using a fuzzy clustering-based EM approach
verfasst von
Md. Geaur Rahman
Md Zahidul Islam
Publikationsdatum
01.02.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0822-y

Weitere Artikel der Ausgabe 2/2016

Knowledge and Information Systems 2/2016 Zur Ausgabe