nach oben

Soft Computing

Erschienen in:

29.10.2019 | Methodologies and Application

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

verfasst von: C. K. Sarumathiy, K. Geetha, C. Rajan

Erschienen in: Soft Computing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Big Data has been a term used in datasets which are complex and large in such a way there are some traditional technologies of data processing which are not adequate. Big Data can revolutionize most aspects in society such as collection or management of data from Big Data which is challenging and also very complex. The Hadoop has been designed for processing a large amount of unstructured and complex data. It has provided with a large amount of storage for data along with the ability to be able to tackle unlimited and concurrent tasks or jobs. The selection of features is an extremely powerful technique in the reduction of dimensionality and is also the most important step in machine learning applications. In recent decades, data is getting larger in a progressive manner in terms of instances and numbers making it very hard to deal with the problem of feature selection. In order to cope with such an epoch of Big Data, there are some more new techniques that are required to address the problem in a more efficient manner. At the same time, the suitability of the algorithms currently used may not be applicable especially when the size of data is above hundreds of gigabytes. For the purpose of this work, the correlation-based feature selection along with mutual information-based methods of feature selection was used for improving the performance. The AdaBoost and the support vector machine based classifiers have been used for improving their accuracy. The results of the experiment prove that the method proposed was able to achieve better performance compared to that of the other methods.

Vorheriger Artikel Graph coloring: a novel heuristic based on trailing path—properties, perspective and applications in structured networks

Nächster Artikel A modified shuffled frog leaping algorithm for scientific workflow scheduling using clustering techniques

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Adhikari BK, Zuo WL, Maharjan R, Guo L (2018) Sensitive data detection using NN and KNN from big data. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 628–642CrossRef

Almasi M, Abadeh MS (2018) A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Clust Comput 21(4):1821–1847CrossRef

Ayma VA, Ferreira RS, Happ P, Oliveira D, Feitosa R, Costa G, Gamba P (2015) Classification algorithms for big data analysis, a Map Reduce approach. Int Arch Photogramm Remote Sens Spat Inf Sci 40(3):17–21CrossRef

Bhardwaj P, Gupta A, Sharma M, Gupta M, Singhal S (2016) A survey on comparative analysis of big data tools. Int J Comput Sci Mob Comput 5(5):789–793

Doquire G, Verleysen M (2011) Feature selection with mutual information for uncertain data. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 330–341CrossRef

Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35CrossRef

Hossen J, Sayeed S (2018) Modifying cleaning method in big data analytics process using random forest classifier. In: 2018 7th international conference on computer and communication engineering (ICCCE). IEEE, pp 208–213

Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol IJIET 1(2):7–14

Lakshmanaprabu SK, Shankar K, Ilayaraja M, Nasir AW, Vijayakumar V, Chilamkurti N (2019) Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cybern 10(10):1–10CrossRef

Li D, Ryu KH, Batbaatar E, Park HW, Jeone SP, Ye Z (2018) An effective feature selection and classification model for high dimensional big data sets. Int J Des Anal Tools Integr Circuits Syst 7(1):38–43

Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15CrossRef

Palma-Mendoza RJ, de-Marcos L, Rodriguez D, Alonso-Betanzos A, (2018) Distributed correlation-based feature selection in spark. Inf Sci 496:287–299CrossRef

Priyadarshini A (2015) A map reduce based support vector machine for big data classification. Int J Database Theory Appl 8(5):77–98CrossRef

Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2018a) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48(9):1441–1453CrossRef

Ramírez-Gallego S, García S, Xiong N, Herrera F (2018b) BELIEF: a distance-based redundancy-proof feature selection method for big data. arXiv preprint arXiv:1804.05774

Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177CrossRef

Thakor HR (2017) A survey paper on classification algorithms in big data. Int J Res Cult Soc 1(3):21–27

Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345CrossRef

Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186CrossRef

Von Kirby P, Gerardo BD, Medina RP (2017) Implementing enhanced AdaBoost algorithm for sales classification and prediction. Int J Trade Econ Finance 8(6):270–273CrossRef

Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6CrossRef

Win TZ, Kham NSM (2018) Mutual information-based feature selection approach to reduce high dimension of big data. In: Proceedings of the 2018 international conference on machine learning and machine intelligence. ACM, pp 3–7

You ZH, Yu JZ, Zhu L, Li S, Wen ZK (2014) A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43CrossRef

Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863

Zakir J, Seymour T, Berg K (2015) Big data analytics. Issues Inf Syst 16(2):81–90

Zdravevski E, Lameski P, Kulakov A, Jakimovski B, Filiposka S, Trajanov D (2015) Feature ranking based on information gain for large classification problems with MapReduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, vol 2, pp 186–191

Titel: Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms
verfasst von: C. K. Sarumathiy
K. Geetha
C. Rajan
Publikationsdatum: 29.10.2019
Verlag: Springer Berlin Heidelberg
Erschienen in: Soft Computing / Ausgabe 1/2020
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI: https://doi.org/10.1007/s00500-019-04453-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2020

Special issue on “Extensions to type-1 fuzzy logic: theory, algorithms and applications”

Computational properties of pentadiagonal and anti-pentadiagonal block band matrices with perturbed corners

Scheduling multi-component maintenance with a greedy heuristic local search algorithm

Toward a development of general type-2 fuzzy classifiers applied in diagnosis problems through embedded type-1 fuzzy classifiers

Incremental mechanism of attribute reduction based on discernible relations for dynamically increasing attribute

Intuitionistic fuzzy adaptive sliding mode control of nonlinear systems