Skip to main content
Top

2015 | OriginalPaper | Chapter

A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering

Authors : Roni Ben Ishay, Maya Herman

Published in: Machine Learning and Data Mining in Pattern Recognition

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this article we present a new and efficient algorithm to handle missing values in databases applied in data mining (DM). Missing values may harm the calculation of the clustering algorithm, and might lead to distorted results. Therefore missing values must be treated before the DM. Commonly, methods to handle missing values are implemented as a separate process from the DM. This may cause a long runtime and may lead to redundant I/O accesses. As a result, the entire DM process may be inefficient. We present a new algorithm (km-Impute) which integrates clustering and imputation of missing values in a unified process. The algorithm was tested on real Red wine quality measures (from the UCI Machine Learning Repository). km-Impute succeeded in imputing missing values and in building clusters as a unified integrated process. The structure and quality of clusters which were produced by km-Impute were similar to clusters of k-means. In addition, the clusters were analyzed by a wine expert. The clusters represented different types of Red wine quality. The success and the accuracy of the imputation were validated using another two datasets: White wine and Page blocks (from the UCI). The results were consistent with the tests which were applied on Red wine: The ratio of success of imputation in all three datasets was similar. Although the complexity of km-Impute was the same as k-means, in practice it was more efficient when applying on middle sized databases: The runtime was significantly shorter than k-means and fewer iterations were required until convergence. km-Impute also performed much less I/O accesses in comparison to k-means.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Pub, Waltham (2012) Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Pub, Waltham (2012)
2.
go back to reference Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Pub, San Francisco (1999) Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Pub, San Francisco (1999)
3.
go back to reference Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012) Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012)
4.
go back to reference Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)CrossRef Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)CrossRef
6.
go back to reference Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M.L., Kenneth Tan, C.J. (eds.) Transactions on Computational Science I. LNCS, vol. 4750, pp. 128–138. Springer, Heidelberg (2008)CrossRef Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M.L., Kenneth Tan, C.J. (eds.) Transactions on Computational Science I. LNCS, vol. 4750, pp. 128–138. Springer, Heidelberg (2008)CrossRef
7.
go back to reference Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 366–377. Springer, Heidelberg (2009)CrossRef Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 366–377. Springer, Heidelberg (2009)CrossRef
10.
go back to reference Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)CrossRef Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)CrossRef
Metadata
Title
A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering
Authors
Roni Ben Ishay
Maya Herman
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-21024-7_8

Premium Partner