Skip to main content
Erschienen in: The Journal of Supercomputing 8/2016

01.08.2016

Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

verfasst von: Fanyu Bu, Zhikui Chen, Qingchen Zhang, Laurence T. Yang

Erschienen in: The Journal of Supercomputing | Ausgabe 8/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel \(k\)-means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60CrossRef Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60CrossRef
2.
Zurück zum Zitat Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169 Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169
3.
Zurück zum Zitat Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef
4.
Zurück zum Zitat Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50 Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
5.
Zurück zum Zitat Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef
6.
Zurück zum Zitat Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744CrossRef Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744CrossRef
7.
Zurück zum Zitat Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776 Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776
8.
Zurück zum Zitat Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131CrossRef Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131CrossRef
9.
Zurück zum Zitat GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327CrossRef GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327CrossRef
10.
Zurück zum Zitat Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002CrossRefMATH Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002CrossRefMATH
11.
Zurück zum Zitat Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88 Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88
12.
Zurück zum Zitat Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32CrossRef Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32CrossRef
13.
Zurück zum Zitat Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129CrossRef Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129CrossRef
14.
15.
Zurück zum Zitat Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212 Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212
16.
Zurück zum Zitat Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579 Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579
17.
Zurück zum Zitat Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137 Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137
18.
Zurück zum Zitat Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35CrossRef Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35CrossRef
19.
Zurück zum Zitat Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393CrossRef Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393CrossRef
20.
Zurück zum Zitat Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797CrossRef Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797CrossRef
21.
Zurück zum Zitat Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266 Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266
22.
Zurück zum Zitat Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477CrossRef Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477CrossRef
23.
Zurück zum Zitat Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552 Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552
Metadaten
Titel
Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud
verfasst von
Fanyu Bu
Zhikui Chen
Qingchen Zhang
Laurence T. Yang
Publikationsdatum
01.08.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 8/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-015-1433-9

Weitere Artikel der Ausgabe 8/2016

The Journal of Supercomputing 8/2016 Zur Ausgabe