Top

Published in:

2020 | OriginalPaper | Chapter

9. Big Data Software

Authors : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Published in: Big Data Preprocessing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The advent of Big Data has created the necessity of new computing tools for processing huge amounts of data. Apache Hadoop was the first open-source framework that implemented the MapReduce paradigm. Apache Spark appeared a few years later improving the Hadoop Ecosystem. Similarly, Apache Flink appeared in the last years for tackling the Big Data streaming problem. However, as these frameworks were created for dealing with huge amounts of data, many practitioners will need machine learning algorithms for extracting the knowledge in the data. The success of a Big Data framework is going to be strongly related to its machine learning capability. This is the reason why nowadays these frameworks include a Big Data machine learning library, MLlib in the case of Spark, and FlinkML for Flink. In this chapter, we analyze in depth both MLlib and FlinkML Big Data libraries. We start with a description of Apache Spark MLlib and all of its components. We continue with a description of a Big Data library focused on data preprocessing for Apache Spark, named BigDaPSpark. Next, we provide an extensive analysis of FlinkML, and its included algorithms and utilities. Lastly, we finish with the description of a Big Data streaming library, focused on data preprocessing for Apache Flink, named BigDaPFlink.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Imbalanced Data Preprocessing for Big Data

next chapter Final Thoughts: From Big Data to Smart Data

MLlib and ml documentation: http://spark.apache.org/docs/latest/ml-guide.html.

FlinkML documentation: https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/ml/.

Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys Tutorials, 17(4), 2347–2376.CrossRef

Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRef

Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.CrossRef

Apache Flink. (2019). Apache Flink. http://flink.apache.org/.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., et al. (2015). Spark SQL: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data, SIGMOD ’15 (pp. 1383–1394).

Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6(3), 211–219.CrossRef

Basgall, M. J., Hasperué, W., Naiouf, M., Fernández, A., & Herrera, F. (2018). SMOTE-BD: An exact and scalable oversampling method for imbalanced classification in big data. Journal of Computer Science and Technology, 18(03), e23.CrossRef

Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations Newsletter, 6(1), 20–29.CrossRef

Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1–2), 245–271.MathSciNetMATHCrossRef

10.

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).

11.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATHCrossRef

12.

Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–1188.CrossRef

13.

Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.CrossRef

14.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI (pp. 1022–1029).

15.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.

16.

Figueredo, G. P., Triguero, I., Mesgarpour, M., Guerra, A. M., Garibaldi, J. M., & John, R. I. (2017). An immune-inspired technique to identify heavy goods vehicles incident hot spots. IEEE Transactions on Emerging Topics in Computational Intelligence, 1(4), 248–258.CrossRef

17.

Gama, J., & Pinto, C. (2006). Discretization from data streams: Applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (pp. 662–667). New York: ACM.CrossRef

18.

García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.MATHCrossRef

19.

García-Gil, D., Alcalde-Barros, A., Luengo, J., García, S., & Herrera, F. (2019). Big data preprocessing as the bridge between big data and smart data: BigDaPSpark and BigDaPFlink libraries. In Proceedings of the 4th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS (pp. 324–331). INSTICC, SciTePress.

20.

García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.CrossRef

21.

García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Principal components analysis random discretization ensemble for big data. Knowledge-Based Systems, 150, 166–174.CrossRef

22.

Gupta, P., Sharma, A., & Jindal, R. (2016). Scalable machine learning algorithms for big data analytics: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(6), 194–214.

23.

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction: Foundations and applications (Studies in fuzziness and soft computing). New York: Springer.MATHCrossRef

24.

Hadoop Distributed File System. (2019). Hadoop Distributed File System. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.

25.

Janssens, J., Huszár, F., Postma, E. O., & van den Herik, H. J. (2012). Stochastic outlier selection. Technical Report, Technical report TiCC TR 2012–001, Tilburg University.

26.

Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In Panhellenic Conference on Informatics (pp. 338–348). Berlin: Springer.

27.

Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255–260.CrossRef

28.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). Mllib: Machine learning in Apache spark. Journal of Machine Learning Research, 17(34), 1–7.MathSciNetMATH

29.

Philip-Chen, C. L., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275(10), 314–347.CrossRef

30.

Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.CrossRef

31.

Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2018). A distributed evolutionary multivariate discretizer for big data processing on Apache spark. Swarm and Evolutionary Computation, 38, 240–250.CrossRef

32.

Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59–70.CrossRef

33.

Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., et al. (2016). Data discretization: Taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5–21.

34.

Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under Apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.CrossRef

35.

Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.CrossRef

36.

Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.CrossRef

37.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., et al. (2014). Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).

38.

Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Machine Learning Proceedings 1994 (pp. 293–301). Amsterdam: Elsevier.CrossRef

39.

Snir, M., & Otto, S. (1998). MPI-The complete reference: The MPI core. Cambridge, MA: MIT Press.

40.

Takane, Y., Young, F. W., & De Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42(1), 7–67.MATHCrossRef

41.

Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics, SMC-6(6), 448–452.MathSciNetMATHCrossRef

42.

Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.CrossRef

43.

Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.

44.

Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, 331–345.CrossRef

45.

Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.CrossRef

46.

Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036).

47.

White, T. (2012). Hadoop: The definitive guide (3rd ed.). Sebastopol, CA: O’Reilly Media.

48.

Wilson, D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421.MathSciNetMATHCrossRef

49.

Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).

50.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (pp. 1–14).

51.

Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the Netflix prize. In R. Fleischer & J. Xu (Eds.), Algorithmic aspects in information and management (pp. 337–348). Berlin/Heidelberg: Springer.CrossRef

Title: Big Data Software
Authors: Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Publisher: Springer International Publishing
Book: Big Data Preprocessing
Print ISBN: 978-3-030-39104-1

Electronic ISBN: 978-3-030-39105-8

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-39105-8_9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner