Top

Knowledge and Information Systems

Published in:

01-02-2016 | Regular Paper

Missing value imputation using a fuzzy clustering-based EM approach

Authors: Md. Geaur Rahman, Md Zahidul Islam

Published in: Knowledge and Information Systems | Issue 2/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and \(t\) test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

previous article POI recommendation through cross-region collaborative filtering

next article Incremental mining of temporal patterns in interval-based database

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Distribution table: students t [online available: http://www.statsoft.com/textbook/distribution-tables/] (2012). Accessed 17 July 2012

Tests for significance [online available: http://www.csulb.edu/msaintg/ppa696/696stsig.htm] (2014). Accessed 12 May 2014

Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749MathSciNetMATH

Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef

Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203CrossRef

Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126

Bø TH, Dysvik B, Jonassen I (2004) Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3):e34–e34CrossRef

Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2013) In-network outlier detection in wireless sensor networks. Knowl Inf Syst 34(1):23–54CrossRef

Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4(5):935–958CrossRef

10.

Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2: 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

11.

Chatzis SP (2011) The fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38:8684–8689CrossRef

12.

Cheng K, Law N, Siu W (2012) Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recognit 45(4):1281–1289. doi:10.1016/j.patcog.2011.10.012 CrossRef

13.

Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227CrossRef

14.

Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetMATH

15.

Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 7 June 2012

16.

Han J, Kamber M (2000) Data: mining Concepts and techniques. The Morgan Kaufmann Series in data management systems 2

17.

Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336CrossRef

18.

Honaker J, King G (2010) What to do about missing values in time-series cross-section data. Am J Polit Sci 54(2):561–581CrossRef

19.

Hourani M, El Emary IM (2009) Microarray missing values imputation methods: critical analysis review. Comput Sci Inf Syst ComSIS 6(2):165–190CrossRef

20.

Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRef

21.

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs NJMATH

22.

Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907CrossRef

23.

Khoshgoftaar T, Van Hulse J (2005) Empirical case studies in attribute noise detection. In: IRI-2005 IEEE international conference on information reuse and integration, conf, 2005. IEEE, pp 211–216

24.

Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271CrossRef

25.

Kim H, Golub G, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198CrossRef

26.

Lee M, Pedrycz W (2009) The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst 160(24):3590–3600MathSciNetCrossRefMATH

27.

Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) RSCTC 2004, LNAI, vol 3066. Springer, Berlin, Heidelberg, pp 573–579

28.

Li L, Huang L, Yang W, Yao X, Liu A (2013) Privacy-preserving lof outlier detection. Knowl Inf Syst 42(3):579–597CrossRef

29.

Liu B, Xiao Y, Cao L, Hao Z, Deng F (2013) SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34(3):597–618CrossRef

30.

Lu Y, Roychowdhury V (2008) Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl Inf Syst 14(2):233–247CrossRef

31.

Luengo J, García S, Herrera F (2011) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108CrossRef

32.

Maletic J, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Proceedings of the conference on information quality. Citeseer, pp 200–209

33.

Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096CrossRef

34.

Pham DT, Dimov SS, Nguyen C (2005) Selection of k in k-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–119CrossRef

35.

Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Australasian data mining conference (AusDM 11), CRPIT, vol 121, pp 41–50. ACS, Ballarat, Australia. http://crpit.com/confpapers/CRPITV121Rahman.pdf

36.

Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: International conference on computer science and information technology (CSIT-2013). Yogyakarta, Indonesia, pp 82–88

37.

Rahman MG, Islam MZ (2013) KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: The 9th international conference on advanced data mining and applications (ADMA 2013) Hangzhou, China

38.

Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl Based Syst. doi:10.1016/j.knosys.2013.08.023

39.

Rahman MG, Islam MZ (2013) A novel framework using two layers of missing value imputation. In: Australasian data mining conference (AusDM 13), CRPIT, vol 146. ACS, Canberra, Australia

40.

Rahman MG, Islam MZ, Bossomaier T, Gao J (2012) Cairad: a co-appearance based analysis for incorrect records and attribute-values detection. In: The 2012 international joint conference on neural networks (IJCNN). IEEE, Brisbane, Australia, pp 1–10. doi:10.1109/IJCNN.2012.6252669

41.

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH

42.

Rubin D (1976) Inference and missing data. Biometrika 63(3):581–592MathSciNetCrossRefMATH

43.

Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef

44.

Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222MathSciNetCrossRef

45.

Sun H, Wang S, Jiang Q (2004) Fcm-based model selection algorithms for determining the number of clusters. Pattern Recognit 37(10):2027–2037CrossRefMATH

46.

Triola MF, Goodman WM, LaBute G, Law R, MacKay L (2006) Elementary statistics. Pearson/Addison-Wesley, Reading, MA

47.

Tseng S, Wang K, Lee CI (2003) A pre-processing method to deal with missing values by integrating clustering and regression techniques. Appl Artif Intell 17(5–6):535–544CrossRef

48.

Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2):221–233

49.

Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(1):32CrossRef

50.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef

51.

Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE international conference on industrial informatics. IEEE, pp 1081–1086 (2006)

52.

Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133CrossRefMATH

53.

Zhang S (2012) Nearest neighbor selection for iteratively k-nn imputation. J Syst Softw 85(11):2541–2552CrossRef

54.

Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459CrossRef

55.

Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121CrossRef

Title: Missing value imputation using a fuzzy clustering-based EM approach
Authors: Md. Geaur Rahman
Md Zahidul Islam
Publication date: 01-02-2016
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 2/2016
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-015-0822-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 2/2016

HICC: an entropy splitting-based framework for hierarchical co-clustering

Efficient entity resolution based on subgraph cohesion

Incremental mining of temporal patterns in interval-based database

Multiple task transfer learning with small sample sizes

A survey on indexing techniques for big data: taxonomy and performance evaluation

Minimizing response time in time series classification

Premium Partner