nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

2. Big Data: Technologies and Tools

verfasst von : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Erschienen in: Big Data Preprocessing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Introduction

Nächstes Kapitel Smart Data

https://spark-packages.org/package/databricks/spark-sklearn.

https://spark-packages.org/package/yahoo/CaffeOnSpark.

https://spark-packages.org/package/h2oai/sparkling-water.

Aggarwal, C. C. (2015). Data mining: The textbook. Berlin: Springer.MATH

Apache Cascading. (2019). http://www.cascading.org/

Apache Drill. (2019). Apache Drill. https://drill.apache.org/

Apache Flink. (2019). http://flink.apache.org/

Apache Flink Project. (2015). Peeking into Apache Flink’s Engine Room. https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

Apache Flume. (2019). https://flume.apache.org/

Apache Giraph. (2019). Apache Giraph. https://giraph.apache.org/

Apache Hive. (2019). https://hive.apache.org/

Apache Ignite. (2019). https://ignite.apache.org/

10.

Apache Mahout. (2019). https://mahout.apache.org/

11.

Apache Pig. (2019). https://pig.apache.org/

12.

Apache Software Foundation. (2019). Apache project directory. https://projects.apache.org/

13.

Apache Spark. (2019). Apache Spark: Lightning-fast cluster computing. http://spark.apache.org/

14.

Apache Spark Project. (2015). Project Tungsten (Apache Spark). https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

15.

Apache Storm. (2019). https://storm.apache.org/

16.

Apache Tez. (2019). https://tez.apache.org/

17.

Apache YARN. (2019). https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

18.

Avro Project. (2019). Avro Project. https://avro.apache.org/

19.

Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140–147.MathSciNetMATHCrossRef

20.

Böhm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728–749.CrossRef

21.

Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485–509.MathSciNetMATHCrossRef

22.

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).

23.

Comer, D. (1979). Ubiquitous B-tree. ACM Computing Surveys, 11(2), 121–137.MathSciNetMATHCrossRef

24.

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley: USENIX Association.

25.

Dennis, J. B. (1974). First version of a data flow procedure language (pp. 362–376). Berlin: Springer.

26.

Dursi, J. (2019). HPC is dying, and MPI is killing it. https://www.dursi.ca/post/hpc-is-dyingand-mpi-is-killing-it.html/. Online; accessed July 2019.

27.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.

28.

Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446), 496–509.MathSciNetMATHCrossRef

29.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.MathSciNetMATHCrossRef

30.

Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776.CrossRef

31.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2019). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 9.CrossRef

32.

García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, 2(1), 1.CrossRef

33.

Gilbert, S., & Lynch, N. (2012). Perspectives on the cap theorem. Computer, 45(2), 30–36.CrossRef

34.

Hadoop Distributed File System. (2019). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

35.

Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.MATHCrossRef

36.

Harris, D. (2013). The history of Hadoop: From 4 nodes to the future of data. https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

37.

Hazelcast. (2019). https://hazelcast.com/

38.

Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics. Sebastopol: O’Reilly Media.

39.

Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Online; accessed March 2019.

40.

Liu, D.C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.MathSciNetMATHCrossRef

41.

Liu, T., Rosenberg, C. J., & Rowley, H. A. (2009). Performing a parallel nearest-neighbor matching operation using a parallel hybrid spill tree. US Patent 7,475,071.

42.

Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2016). kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.CrossRef

43.

Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., et al. (2010). Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (pp. 330–339).CrossRef

44.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34), 1–7.MathSciNetMATH

45.

MongoDB. (2019). https://www.mongodb.com/

46.

NoSQL Database. (2019). NoSQL database. http://nosql-database.org/

47.

Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).

48.

Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 1426–1437.CrossRef

49.

Parquet Project. (2019). Parquet Project. https://parquet.apache.org/

50.

Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2019). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.CrossRef

51.

Robbins, H., & Monro, S. (1985). A stochastic approximation method. In Herbert Robbins selected papers (pp. 102–109). Berlin: Springer.CrossRef

52.

Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16(5), 1063–1076.MATHCrossRef

53.

Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc., Buffalo.

54.

Ross Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

55.

Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Kuala Lumpur: Pearson Education Limited.MATH

56.

Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional (1st ed.). Boston: Addison-Wesley.

57.

Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. Technical Report, Indiana University.

58.

Spark Packages. (2019). 3rd party spark packages. https://spark-packages.org/

59.

Spark Petabyte Sort. (2014). Apache Spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

60.

Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.CrossRef

61.

Stonebraker, M. (1986) The case for shared nothing. Database Engineering, 9, 4–9.

62.

Sung, M. (2000). SIMD parallel processing Michael Sung 6.911: Architectures anonymous. http://www.ai.mit.edu/projects/aries/papers/writeups/darkman-writeup.pdf/. [Online; accessed July 2019].

63.

The H2O.ai team. (2015). H2O: Scalable machine learning. http://www.h2o.ai

64.

Valiant, L. G. (1990). A bridging model for parallel computation. Communications of ACM, 33(8), 103–111.CrossRef

65.

Wei, L.-J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.CrossRef

66.

Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.

67.

Yu, S., & Guo, S. (2016). Big data concepts, theories, and applications. Amsterdam: Elsevier.CrossRef

68.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12 (pp. 2–2).

Titel: Big Data: Technologies and Tools
verfasst von: Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Verlag: Springer International Publishing
Buch: Big Data Preprocessing
Print ISBN: 978-3-030-39104-1

Electronic ISBN: 978-3-030-39105-8

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-030-39105-8_2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner