Skip to main content
Erschienen in: Soft Computing 1/2020

29.10.2019 | Methodologies and Application

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

verfasst von: C. K. Sarumathiy, K. Geetha, C. Rajan

Erschienen in: Soft Computing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Big Data has been a term used in datasets which are complex and large in such a way there are some traditional technologies of data processing which are not adequate. Big Data can revolutionize most aspects in society such as collection or management of data from Big Data which is challenging and also very complex. The Hadoop has been designed for processing a large amount of unstructured and complex data. It has provided with a large amount of storage for data along with the ability to be able to tackle unlimited and concurrent tasks or jobs. The selection of features is an extremely powerful technique in the reduction of dimensionality and is also the most important step in machine learning applications. In recent decades, data is getting larger in a progressive manner in terms of instances and numbers making it very hard to deal with the problem of feature selection. In order to cope with such an epoch of Big Data, there are some more new techniques that are required to address the problem in a more efficient manner. At the same time, the suitability of the algorithms currently used may not be applicable especially when the size of data is above hundreds of gigabytes. For the purpose of this work, the correlation-based feature selection along with mutual information-based methods of feature selection was used for improving the performance. The AdaBoost and the support vector machine based classifiers have been used for improving their accuracy. The results of the experiment prove that the method proposed was able to achieve better performance compared to that of the other methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Adhikari BK, Zuo WL, Maharjan R, Guo L (2018) Sensitive data detection using NN and KNN from big data. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 628–642CrossRef Adhikari BK, Zuo WL, Maharjan R, Guo L (2018) Sensitive data detection using NN and KNN from big data. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 628–642CrossRef
Zurück zum Zitat Almasi M, Abadeh MS (2018) A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Clust Comput 21(4):1821–1847CrossRef Almasi M, Abadeh MS (2018) A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Clust Comput 21(4):1821–1847CrossRef
Zurück zum Zitat Ayma VA, Ferreira RS, Happ P, Oliveira D, Feitosa R, Costa G, Gamba P (2015) Classification algorithms for big data analysis, a Map Reduce approach. Int Arch Photogramm Remote Sens Spat Inf Sci 40(3):17–21CrossRef Ayma VA, Ferreira RS, Happ P, Oliveira D, Feitosa R, Costa G, Gamba P (2015) Classification algorithms for big data analysis, a Map Reduce approach. Int Arch Photogramm Remote Sens Spat Inf Sci 40(3):17–21CrossRef
Zurück zum Zitat Bhardwaj P, Gupta A, Sharma M, Gupta M, Singhal S (2016) A survey on comparative analysis of big data tools. Int J Comput Sci Mob Comput 5(5):789–793 Bhardwaj P, Gupta A, Sharma M, Gupta M, Singhal S (2016) A survey on comparative analysis of big data tools. Int J Comput Sci Mob Comput 5(5):789–793
Zurück zum Zitat Doquire G, Verleysen M (2011) Feature selection with mutual information for uncertain data. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 330–341CrossRef Doquire G, Verleysen M (2011) Feature selection with mutual information for uncertain data. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 330–341CrossRef
Zurück zum Zitat Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35CrossRef Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35CrossRef
Zurück zum Zitat Hossen J, Sayeed S (2018) Modifying cleaning method in big data analytics process using random forest classifier. In: 2018 7th international conference on computer and communication engineering (ICCCE). IEEE, pp 208–213 Hossen J, Sayeed S (2018) Modifying cleaning method in big data analytics process using random forest classifier. In: 2018 7th international conference on computer and communication engineering (ICCCE). IEEE, pp 208–213
Zurück zum Zitat Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol IJIET 1(2):7–14 Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol IJIET 1(2):7–14
Zurück zum Zitat Lakshmanaprabu SK, Shankar K, Ilayaraja M, Nasir AW, Vijayakumar V, Chilamkurti N (2019) Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cybern 10(10):1–10CrossRef Lakshmanaprabu SK, Shankar K, Ilayaraja M, Nasir AW, Vijayakumar V, Chilamkurti N (2019) Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cybern 10(10):1–10CrossRef
Zurück zum Zitat Li D, Ryu KH, Batbaatar E, Park HW, Jeone SP, Ye Z (2018) An effective feature selection and classification model for high dimensional big data sets. Int J Des Anal Tools Integr Circuits Syst 7(1):38–43 Li D, Ryu KH, Batbaatar E, Park HW, Jeone SP, Ye Z (2018) An effective feature selection and classification model for high dimensional big data sets. Int J Des Anal Tools Integr Circuits Syst 7(1):38–43
Zurück zum Zitat Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15CrossRef Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15CrossRef
Zurück zum Zitat Palma-Mendoza RJ, de-Marcos L, Rodriguez D, Alonso-Betanzos A, (2018) Distributed correlation-based feature selection in spark. Inf Sci 496:287–299CrossRef Palma-Mendoza RJ, de-Marcos L, Rodriguez D, Alonso-Betanzos A, (2018) Distributed correlation-based feature selection in spark. Inf Sci 496:287–299CrossRef
Zurück zum Zitat Priyadarshini A (2015) A map reduce based support vector machine for big data classification. Int J Database Theory Appl 8(5):77–98CrossRef Priyadarshini A (2015) A map reduce based support vector machine for big data classification. Int J Database Theory Appl 8(5):77–98CrossRef
Zurück zum Zitat Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2018a) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48(9):1441–1453CrossRef Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2018a) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48(9):1441–1453CrossRef
Zurück zum Zitat Ramírez-Gallego S, García S, Xiong N, Herrera F (2018b) BELIEF: a distance-based redundancy-proof feature selection method for big data. arXiv preprint arXiv:1804.05774 Ramírez-Gallego S, García S, Xiong N, Herrera F (2018b) BELIEF: a distance-based redundancy-proof feature selection method for big data. arXiv preprint arXiv:​1804.​05774
Zurück zum Zitat Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177CrossRef Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177CrossRef
Zurück zum Zitat Thakor HR (2017) A survey paper on classification algorithms in big data. Int J Res Cult Soc 1(3):21–27 Thakor HR (2017) A survey paper on classification algorithms in big data. Int J Res Cult Soc 1(3):21–27
Zurück zum Zitat Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345CrossRef Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345CrossRef
Zurück zum Zitat Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186CrossRef Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186CrossRef
Zurück zum Zitat Von Kirby P, Gerardo BD, Medina RP (2017) Implementing enhanced AdaBoost algorithm for sales classification and prediction. Int J Trade Econ Finance 8(6):270–273CrossRef Von Kirby P, Gerardo BD, Medina RP (2017) Implementing enhanced AdaBoost algorithm for sales classification and prediction. Int J Trade Econ Finance 8(6):270–273CrossRef
Zurück zum Zitat Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6CrossRef Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6CrossRef
Zurück zum Zitat Win TZ, Kham NSM (2018) Mutual information-based feature selection approach to reduce high dimension of big data. In: Proceedings of the 2018 international conference on machine learning and machine intelligence. ACM, pp 3–7 Win TZ, Kham NSM (2018) Mutual information-based feature selection approach to reduce high dimension of big data. In: Proceedings of the 2018 international conference on machine learning and machine intelligence. ACM, pp 3–7
Zurück zum Zitat You ZH, Yu JZ, Zhu L, Li S, Wen ZK (2014) A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43CrossRef You ZH, Yu JZ, Zhu L, Li S, Wen ZK (2014) A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43CrossRef
Zurück zum Zitat Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863 Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863
Zurück zum Zitat Zakir J, Seymour T, Berg K (2015) Big data analytics. Issues Inf Syst 16(2):81–90 Zakir J, Seymour T, Berg K (2015) Big data analytics. Issues Inf Syst 16(2):81–90
Zurück zum Zitat Zdravevski E, Lameski P, Kulakov A, Jakimovski B, Filiposka S, Trajanov D (2015) Feature ranking based on information gain for large classification problems with MapReduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, vol 2, pp 186–191 Zdravevski E, Lameski P, Kulakov A, Jakimovski B, Filiposka S, Trajanov D (2015) Feature ranking based on information gain for large classification problems with MapReduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, vol 2, pp 186–191
Metadaten
Titel
Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms
verfasst von
C. K. Sarumathiy
K. Geetha
C. Rajan
Publikationsdatum
29.10.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 1/2020
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-019-04453-x

Weitere Artikel der Ausgabe 1/2020

Soft Computing 1/2020 Zur Ausgabe