nach oben

The Journal of Supercomputing

Erschienen in:

01.08.2016

Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

verfasst von: Fanyu Bu, Zhikui Chen, Qingchen Zhang, Laurence T. Yang

Erschienen in: The Journal of Supercomputing | Ausgabe 8/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel \(k\)-means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.

Vorheriger Artikel Critical data points-based unsupervised linear dimension reduction technology for science data

Nächster Artikel ELM-based spammer detection in social networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60CrossRef

Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169

Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871CrossRef

Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50

Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533CrossRef

Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744CrossRef

Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776

Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131CrossRef

GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327CrossRef

10.

Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002CrossRefMATH

11.

Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88

12.

Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32CrossRef

13.

Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129CrossRef

14.

Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222MathSciNetCrossRef

15.

Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212

16.

Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579

17.

Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137

18.

Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35CrossRef

19.

Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393CrossRef

20.

Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797CrossRef

21.

Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266

22.

Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477CrossRef

23.

Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552

Titel: Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud
verfasst von: Fanyu Bu
Zhikui Chen
Qingchen Zhang
Laurence T. Yang
Publikationsdatum: 01.08.2016
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 8/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-015-1433-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 8/2016

CEVP: Cross Entropy based Virtual Machine Placement for Energy Optimization in Clouds

An intelligent system for predicting and preventing MERS-CoV infection outbreak

QuaCentive: a quality-aware incentive mechanism in mobile crowdsourced sensing (MCS)

Exploring large-scale small file storage for search engines

Critical data points-based unsupervised linear dimension reduction technology for science data

Regulations and latency-aware load distribution of web applications in Multi-Clouds