Skip to main content
Log in

Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel \(k\)-means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60

    Article  Google Scholar 

  2. Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169

    Google Scholar 

  3. Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871

    Article  Google Scholar 

  4. Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50

  5. Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533

    Article  Google Scholar 

  6. Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744

    Article  Google Scholar 

  7. Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776

  8. Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131

    Article  Google Scholar 

  9. GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327

    Article  Google Scholar 

  10. Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002

    Article  MATH  Google Scholar 

  11. Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88

  12. Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32

    Article  Google Scholar 

  13. Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129

    Article  Google Scholar 

  14. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  15. Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212

  16. Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579

  17. Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137

  18. Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35

    Article  Google Scholar 

  19. Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393

    Article  Google Scholar 

  20. Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797

    Article  Google Scholar 

  21. Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266

  22. Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477

    Article  Google Scholar 

  23. Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingchen Zhang.

Additional information

This work was supported by Project U1301253 of NSFC and Project 201202032 of Liaoning Provincial Natural Science Foundation of China.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bu, F., Chen, Z., Zhang, Q. et al. Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud. J Supercomput 72, 2977–2990 (2016). https://doi.org/10.1007/s11227-015-1433-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1433-9

Keywords

Navigation