nach oben

Soft Computing

Erschienen in:

11.04.2019 | Methodologies and Application

A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

verfasst von: Samaher Al-Janabi, Ayad F. Alkaim

Erschienen in: Soft Computing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

One of the important trends in an intelligent data analysis will be the growing importance of data processing. But this point faces problems similar to those of data mining (i.e., high-dimensional data, missing value imputation and data integration); one of the challenges in estimation missing value methods is how to select the optimal number of nearest neighbors of those values. This paper, attempting to search the capability of building a novel tool to estimate missing values of various datasets called developed random forest and local least squares (DRFLLS). By developing random forest algorithm, seven categories of similarity measures were defined. These categories are person similarity coefficient, simple similarity, and fuzzy similarity (M1, M2, M3, M4 and M5). They are sufficient to estimate the optimal number of neighborhoods of missing values in this application. Hereafter, local least squares (LLS) has been used to estimate the missing values. Imputation accuracy can be measured in different ways: Pearson correlation (PC) and NRMSE. Then, the optimal number of neighborhoods is associated with the highest value of PC and a smaller value of NRMSE. The experimental results were carried out on six datasets obtained from different disciplines, and DRFLLS proves the dataset which has a small rate of missing values gave the best estimation to the number of nearest neighbors by DRFPC and in the second degree by DRFFSM1 when r = 4, while if the dataset has high rate of missing values, then it gave the best estimation to number of nearest neighbors by DRFFSM5 and in the second degree by DRFFSM3. After that, the missing value was estimated by LLS, and the results accuracy was measured by NRMSE and Pearson correlation. The smallest value of NRMSE for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset. The highest value of PC for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset.

Vorheriger Artikel A multiple pheromone ant colony optimization scheme for energy-efficient wireless sensor networks

Nächster Artikel Modeling of EHD inkjet printing performance using soft computing-based approaches

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19

Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36

Adam E, Mutanga O, Odindi J, Abdel-Rahman EM (2014) Land-use/cover classification in a heterogeneous coastal landscape using Rapid Eye imagery: evaluating the performance of random forest and support vector machines classifiers. Int J Rem Sens 35(10):3440–3458

Ali SH (2012a) Miner for OACCR: case of medical data analysis in knowledge discovery. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse pp 962–975. https://doi.org/10.1109/setit.2012.6482043

Ali SH (2012b) A novel tool (FP-KC) for handle the three main dimensions reduction and association rule mining. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse, pp 951–961. https://doi.org/10.1109/setit.2012.6482042

Ali SH (2013) Novel approach for generating the key of stream cipher system using random forest data mining algorithm. In: IEEE, 2013 sixth international conference on developments in e-systems engineering, Abu Dhabi, pp 259–269 (2013). https://doi.org/10.1109/dese.2013.54

Al-Janabi S (2017) Pragmatic miner to risk analysis for intrusion detection (PMRA-ID). In: Mohamed A, Berry M, Yap B (eds) Soft computing in data science. SCDS 2017. Communications in Computer and Information Science, vol 788. Springer, Singapore. https://doi.org/10.1007/978-981-10-7242-0_23

Al-Janabi S (2018) Smart system to create optimal higher education environment using IDA and IOTs. Int J Comput Appl. https://doi.org/10.1080/1206212X.2018.1512460

Aljarah I, Mafarja M, Heidari AA, Hossam F, Yong Z, Mirjalili S (2018) Asynchronous accelerating multi-leader Salp chains for feature selection. Appl Soft Comput 71:964–979. https://doi.org/10.1016/j.asoc.2018.07.040

Bose S, Das C, Chakraborty A, Chattopadhyay S (2013) Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data. In: Advances in computing and information technology. Springer, Berlin, pp 37–47

Breiman L (2001) Random forests. Mach Learn 45(1):5–32MATH

Bruggeman J, Heringa J, Brandt B (2009) PhyloPars: estimation of missing parameter values using phylogeny. Nucleic Acids Res 37(2):W179–W184

Carranza EJM, Laborte AG (2015) Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines). Comput Geosci 74:60–70

Center for Machine Learning and Intelligent Systems, USA (2010a) http://archive.ics.uci.edu/ml/datasets/p53+Mutants

Center for Machine Learning and Intelligent Systems, USA (2010b). https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis

Chiu CC, Chan SY, Wang CC, Wu WS (2013) Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol 7(6):S12

Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792

Elyan E, Gaber MM (2016) A fine-grained Random Forests using class decomposition: an application to medical diagnosis. Neural Comput Appl 27(8):2279–2288

Genbank 64.1 (1992) http://archive.ics.uci.edu/ml/machine-learning/datasets/DNA/

Genbank 64.1 (2018). http://idke.ruc.edu.cn/news/2008/dataset.htm

Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236

Golub GH, Kim H, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198

Graham JW (2012) Missing data: analysis and design. Springer, New YorkMATH

Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. University of Illinois at Urbana-Champaign. San Francisco. Elsevier 2006. www.books.elsevier.com

Hapfelmeier A, Hothorn T, Ulm K (2012) Recursive partitioning on incomplete data using surrogate decisions and multiple imputation. Comput Stat Data Anal 56(6):1552–1565MathSciNetMATH

Heidari AA, Faris H, Aljarah I, Mirjalili S (2018) An efficient hybrid multilayer perceptron neural network with grasshopper optimization. Soft Comput. https://doi.org/10.1007/s00500-018-3424-2

Heidari AA, Aljarah I, Faris H, Chen H, Luo J, Mirjalili S (2019) An enhanced associative learning-based exploratory whale optimizer for global optimization. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04015-0

James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, New YorkMATH

Kumar V, Wu X, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37

Liew AWC, Law NF, Yan H (2010) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513

Mafarja M, Aljarah I, Heidari AA, Faris H, Fournier-Viger P, Li X, Mirjalili S (2018) Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl Based Syst 161:185–204. https://doi.org/10.1016/j.knosys.2018.08.003

McCandless T, Haupt SE, Young G (2011) Replacing missing data for ensemble systems. J Comput 6(2):162–171

Moorthy K, Saberi Mohamad M, Deris S (2014) A review on missing value imputation algorithms for microarray gene expression data. Curr Bioinform 9(1):18–22

Pantanowitz A, Marwala T (2009) Missing data imputation through the use of the random forest algorithm. In: Yu W, Sanchez EN (eds) Advances in computational intelligence. Advances in Intelligent and Soft Computing, vol 116, Springer, Berlin, pp 53–62

Qi Y, Klein-Seetharaman J, Bar Z (2005) Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomp 10:531–542

Redmond M (2009) Center for machine learning and intelligent systems. Computer Science, La Salle University, Philadelphia, PA

Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report Number 79, Department of Statistics, Ludwig-Maximilians-Universität, Munich

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592MathSciNetMATH

Rubin DB (1996) Multiple imputation after 18 + years. J Am Stat Assoc 91(434):473–489MATH

Ryan C, Green D, Cagney G, Cunningham P (2010) Missing value imputation for epistatic MAPs. Bioinformatics 11:197

Saul LK, Savage S, Ma J, Voelker GM (2009) Identifying suspicious URLs: an application of large-scale online learning. In: 26th annual international conference on machine learning (ICML), Montreal (2009) pp 681–688

Stekhoven DJ, Bühlmann P (2012) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recognit 44(2):330–349

Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Higgins PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847

Wasito I, Mirkin B (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Comput Stat Data Anal 50(4):926–949MathSciNetMATH

Waske B, Chi M, Benediktsson JA, van der Linden S, Koetz B (2010) Algorithms and applications for land cover classification—a review. In: Li D, Shan J, Gong J (eds) Geospatial technology for earth observation. Springer, Boston, MA, pp 203–233

Xie Y, Li X, Ngai EWT, Ying W (2009) Customer churn prediction using improved balanced random forests. Expert Syst Appl 36(3):5445–5449

Zhou Z, Zhang R, Lin Y, Wang R (2015) A comparison of similarity measures of intuitionistic fuzzy sets. In: LISS 2014, pp 1237–1242

Titel: A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation
verfasst von: Samaher Al-Janabi
Ayad F. Alkaim
Publikationsdatum: 11.04.2019
Verlag: Springer Berlin Heidelberg
Erschienen in: Soft Computing / Ausgabe 1/2020
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI: https://doi.org/10.1007/s00500-019-03972-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2020

Application of the TOPSİS method to improve software efficiency and to optimize its management

A novel parameter estimation in dynamic model via fuzzy swarm intelligence and chaos theory for faults in wastewater treatment plant

Classification of non-local rings with genus two zero-divisor graphs

An optimal redistribution plan considering aftermath disruption in disaster management

Quantile fuzzy regression based on fuzzy outputs and fuzzy parameters

Monadic Boolean algebras with an automorphism and their relation to -algebras