Skip to main content
Erschienen in: International Journal of Data Science and Analytics 3-4/2016

01.11.2016 | Review

Big data analytics on Apache Spark

verfasst von: Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, Joshua Zhexue Huang

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 3-4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
15
Microsoft Announcement: http://​tinyurl.​com/​gmjwan9.
 
23
Supplemental Spark Projects: http://​tinyurl.​com/​j4z3ppl.
 
Literatur
1.
Zurück zum Zitat Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, pp 29–42 (2013). doi:10.1145/2465351.2465355 Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, pp 29–42 (2013). doi:10.​1145/​2465351.​2465355
4.
Zurück zum Zitat Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: International Conference on Machine Learning (2007) Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: International Conference on Machine Learning (2007)
5.
Zurück zum Zitat Apiletti, D., Garza, P., Pulvirenti, F.: New Trends in databases and information systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap A Review of Scalable Approaches for Frequent Itemset Mining, pp. 243–247 (2015) Apiletti, D., Garza, P., Pulvirenti, F.: New Trends in databases and information systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap A Review of Scalable Approaches for Frequent Itemset Mining, pp. 243–247 (2015)
10.
Zurück zum Zitat Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: How data volume affects spark based data analytics on a scale-up server. CoRR arxiv:1507.08340 (2015) Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: How data volume affects spark based data analytics on a scale-up server. CoRR arxiv:​1507.​08340 (2015)
11.
Zurück zum Zitat Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: Apache spark case study. CoRR arXiv:1604.08484 (2016) Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: Apache spark case study. CoRR arXiv:​1604.​08484 (2016)
12.
Zurück zum Zitat Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemML. Proc. VLDB Endow. 7(7), 553–564 (2014). doi:10.14778/2732286.2732292 CrossRef Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemML. Proc. VLDB Endow. 7(7), 553–564 (2014). doi:10.​14778/​2732286.​2732292 CrossRef
16.
Zurück zum Zitat Capotă, M, Hegeman, T., Iosup, A., Prat-Pérez, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of the GRADES’15, ACM, New York, NY, USA, GRADES’15, pp. 7:1–7:6. doi:10.1145/2764947.2764954 (2015) Capotă, M, Hegeman, T., Iosup, A., Prat-Pérez, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of the GRADES’15, ACM, New York, NY, USA, GRADES’15, pp. 7:1–7:6. doi:10.​1145/​2764947.​2764954 (2015)
17.
Zurück zum Zitat Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-mat: a recursive model for graph mining. In: In Fourth SIAM International Conference on Data Mining (2004) Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-mat: a recursive model for graph mining. In: In Fourth SIAM International Conference on Data Mining (2004)
20.
Zurück zum Zitat Crankshaw, D., Bailis, P., Gonzalez, J.E., Li, H., Zhang, Z., Franklin, M.J., Ghodsi, A., Jordan, M.I.: The missing piece in complex analytics: low latency, scalable model management and serving with velox. CoRR arxiv:1409.3809 (2014) Crankshaw, D., Bailis, P., Gonzalez, J.E., Li, H., Zhang, Z., Franklin, M.J., Ghodsi, A., Jordan, M.I.: The missing piece in complex analytics: low latency, scalable model management and serving with velox. CoRR arxiv:​1409.​3809 (2014)
25.
Zurück zum Zitat Dave, A., Jindal, A., Li, L.E., Xin, R., Gonzalez, J., Zaharia, M.: Graphframes: an integrated api for mixing graph and relational queries. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 2:1–2:8. doi:10.1145/2960414.2960416 (2016) Dave, A., Jindal, A., Li, L.E., Xin, R., Gonzalez, J., Zaharia, M.: Graphframes: an integrated api for mixing graph and relational queries. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 2:1–2:8. doi:10.​1145/​2960414.​2960416 (2016)
26.
Zurück zum Zitat Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: A runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, New York, NY, USA, HPDC ’10, pp. 810–818. doi:10.1145/1851476.1851593 (2010) Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: A runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, New York, NY, USA, HPDC ’10, pp. 810–818. doi:10.​1145/​1851476.​1851593 (2010)
27.
Zurück zum Zitat Fernndez, A., del Ro, S., Lpez, V., Bawakid, A., del Jesus, M.J., Bentez, J.M., Herrera, F.: Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(5), 380–409 (2014). doi:10.1002/widm.1134 CrossRef Fernndez, A., del Ro, S., Lpez, V., Bawakid, A., del Jesus, M.J., Bentez, J.M., Herrera, F.: Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(5), 380–409 (2014). doi:10.​1002/​widm.​1134 CrossRef
31.
Zurück zum Zitat Ganelin, l: Spark: Big Data Cluster Computing in Production. Wiley, New York (2016)CrossRef Ganelin, l: Spark: Big Data Cluster Computing in Production. Wiley, New York (2016)CrossRef
32.
Zurück zum Zitat Ghoting, A., Krishnamurthy, R., Pednault, E.P.D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: Systemml: Declarative machine learning on mapreduce. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K. (eds.) Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, IEEE Computer Society, pp. 231–242. doi:10.1109/ICDE.2011.5767930 (2011) Ghoting, A., Krishnamurthy, R., Pednault, E.P.D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: Systemml: Declarative machine learning on mapreduce. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K. (eds.) Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, IEEE Computer Society, pp. 231–242. doi:10.​1109/​ICDE.​2011.​5767930 (2011)
33.
Zurück zum Zitat Gonzalez, J.E.: From graphs to tables the design of scalable systems for graph analytics. In: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, Companion Volume, pp. 1149–1150. doi:10.1145/2567948.2580059 (2014) Gonzalez, J.E.: From graphs to tables the design of scalable systems for graph analytics. In: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, Companion Volume, pp. 1149–1150. doi:10.​1145/​2567948.​2580059 (2014)
35.
Zurück zum Zitat Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’14, pp. 599–613. http://dl.acm.org/citation.cfm?id=2685048.2685096 (2014) Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’14, pp. 599–613. http://​dl.​acm.​org/​citation.​cfm?​id=​2685048.​2685096 (2014)
36.
Zurück zum Zitat Gopalani, S., Arora, R.: Article: Comparing apache spark and map reduce with performance analysis using k-means. Int. J. Comput. Appl. 113(1), 8–11 (2015). (full text available) Gopalani, S., Arora, R.: Article: Comparing apache spark and map reduce with performance analysis using k-means. Int. J. Comput. Appl. 113(1), 8–11 (2015). (full text available)
38.
Zurück zum Zitat Gulzar, M.A., Interlandi, M., Yoo, S., Tetali, S.D., Condie, T., Millstein, T., Kim, M.: Bigdebug: debugging primitives for interactive big data processing in spark. In: Proceedings of 38th IEEE/ACM International Conference on Software Engineering, ICSE’ 16 (2016) Gulzar, M.A., Interlandi, M., Yoo, S., Tetali, S.D., Condie, T., Millstein, T., Kim, M.: Bigdebug: debugging primitives for interactive big data processing in spark. In: Proceedings of 38th IEEE/ACM International Conference on Software Engineering, ICSE’ 16 (2016)
39.
Zurück zum Zitat Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’11, pp. 295–308. http://dl.acm.org/citation.cfm?id=1972457.1972488 (2011) Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’11, pp. 295–308. http://​dl.​acm.​org/​citation.​cfm?​id=​1972457.​1972488 (2011)
43.
Zurück zum Zitat Iyer, A.P., Li, L.E., Das, T., Stoica, I.: Time-evolving graph processing at scale. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 5:1–5:6. doi:10.1145/2960414.2960419 (2016) Iyer, A.P., Li, L.E., Das, T., Stoica, I.: Time-evolving graph processing at scale. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 5:1–5:6. doi:10.​1145/​2960414.​2960419 (2016)
44.
Zurück zum Zitat Jarrah, M., Al-Quraan, M., Jararweh, Y., Al-Ayyoub, M.: Medgraph: a graph-based representation and computation to handle large sets of images. Multimedia Tools and Applications, pp. 1–17. doi:10.1007/s11042-016-3262-0 (2016) Jarrah, M., Al-Quraan, M., Jararweh, Y., Al-Ayyoub, M.: Medgraph: a graph-based representation and computation to handle large sets of images. Multimedia Tools and Applications, pp. 1–17. doi:10.​1007/​s11042-016-3262-0 (2016)
45.
Zurück zum Zitat Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Inc, Sebastopol (2015) Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Inc, Sebastopol (2015)
46.
Zurück zum Zitat Kim, H., Park, J., Jang, J., Yoon, S.: Deepspark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. CoRR arXiv:1602.08191 (2016) Kim, H., Park, J., Jang, J., Yoon, S.: Deepspark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. CoRR arXiv:​1602.​08191 (2016)
47.
Zurück zum Zitat Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, ACM, pp. 426–434. doi:10.1145/1401890.1401944 (2008) Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, ACM, pp. 426–434. doi:10.​1145/​1401890.​1401944 (2008)
49.
Zurück zum Zitat Krishnan, D.R., Quoc, D.L., Bhatotia, P., Fetzer, C., Rodrigues, R.: Incapprox: A data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1133–1144 (2016) Krishnan, D.R., Quoc, D.L., Bhatotia, P., Fetzer, C., Rodrigues, R.: Incapprox: A data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1133–1144 (2016)
51.
Zurück zum Zitat Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 1–36 (2015). doi:10.1186/s40537-015-0032-1 CrossRef Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 1–36 (2015). doi:10.​1186/​s40537-015-0032-1 CrossRef
52.
Zurück zum Zitat Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, ACM, pp. 1–15 (2014) Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, ACM, pp. 1–15 (2014)
53.
Zurück zum Zitat Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench. In: Proceedings of the 12th ACM International Conference on Computing Frontiers—CF ’15, ACM Press, New York, New York, USA, pp. 1–8. doi:10.1145/2742854.2747283 (2015) Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench. In: Proceedings of the 12th ACM International Conference on Computing Frontiers—CF ’15, ACM Press, New York, New York, USA, pp. 1–8. doi:10.​1145/​2742854.​2747283 (2015)
54.
Zurück zum Zitat Li, P., Luo, Y., Zhang, N., Cao, Y.: Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms. In: 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 347–348. doi:10.1109/NAS.2015.7255222 (2015) Li, P., Luo, Y., Zhang, N., Cao, Y.: Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms. In: 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 347–348. doi:10.​1109/​NAS.​2015.​7255222 (2015)
55.
Zurück zum Zitat Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: A New Framework for Parallel Machine Learning, pp. 8–11. arxiv:1006.4990 (2010) Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: A New Framework for Parallel Machine Learning, pp. 8–11. arxiv:​1006.​4990 (2010)
56.
Zurück zum Zitat Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi:10.14778/2212351.2212354 CrossRef Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi:10.​14778/​2212351.​2212354 CrossRef
58.
Zurück zum Zitat Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’10, pp. 135–146. doi:10.1145/1807167.1807184 (2010) Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’10, pp. 135–146. doi:10.​1145/​1807167.​1807184 (2010)
59.
Zurück zum Zitat Marcu, O.C., Costan, A., Antoniu, G., Pérez, M.S.: Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. In: Cluster 2016—The IEEE 2016 International Conference on Cluster Computing, Taipei, Taiwan. https://hal.inria.fr/hal-01347638 (2016) Marcu, O.C., Costan, A., Antoniu, G., Pérez, M.S.: Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. In: Cluster 2016—The IEEE 2016 International Conference on Cluster Computing, Taipei, Taiwan. https://​hal.​inria.​fr/​hal-01347638 (2016)
60.
Zurück zum Zitat Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.A.: Adam: Genomics formats and processing patterns for butt scale computing. Tech. Rep. UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013) Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.A.: Adam: Genomics formats and processing patterns for butt scale computing. Tech. Rep. UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013)
61.
Zurück zum Zitat Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: Machine learning in apache spark arXiv:1505.06807 (2015) Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: Machine learning in apache spark arXiv:​1505.​06807 (2015)
64.
65.
Zurück zum Zitat O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: Variantspark: population scale clustering of genotype information. BMC Genom. 16(1), 1–9 (2015). doi:10.1186/s12864-015-2269-7 CrossRef O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: Variantspark: population scale clustering of genotype information. BMC Genom. 16(1), 1–9 (2015). doi:10.​1186/​s12864-015-2269-7 CrossRef
66.
Zurück zum Zitat Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’15, pp. 293–307. http://dl.acm.org/citation.cfm?id=2789770.2789791 (2015) Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’15, pp. 293–307. http://​dl.​acm.​org/​citation.​cfm?​id=​2789770.​2789791 (2015)
67.
Zurück zum Zitat Palamuttam, R., Mogrovejo, R.M., Mattmann, C., Wilson, B., Whitehall, K., Verma, R., McGibbney, L.J., Ramirez, P.M.: Scispark: applying in-memory distributed computing to weather event detection and tracking. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29-November 1, 2015, IEEE, pp. 2020–2026. doi:10.1109/BigData.2015.7363983 (2015) Palamuttam, R., Mogrovejo, R.M., Mattmann, C., Wilson, B., Whitehall, K., Verma, R., McGibbney, L.J., Ramirez, P.M.: Scispark: applying in-memory distributed computing to weather event detection and tracking. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29-November 1, 2015, IEEE, pp. 2020–2026. doi:10.​1109/​BigData.​2015.​7363983 (2015)
68.
Zurück zum Zitat Ramrez-Gallego, S., Garca, S., Mourio-Taln, H., Martnez-Rego, D., Boln-Canedo, V., Alonso-Betanzos, A., Bentez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173 CrossRef Ramrez-Gallego, S., Garca, S., Mourio-Taln, H., Martnez-Rego, D., Boln-Canedo, V., Alonso-Betanzos, A., Bentez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.​1002/​widm.​1173 CrossRef
69.
Zurück zum Zitat Richter, A.N., Khoshgoftaar, T.M., Landset, S., Hasanin, T.: A multi-dimensional comparison of toolkits for machine learning with big data. In: 2015 IEEE International Conference on Information Reuse and Integration, IRI 2015, San Francisco, CA, USA, August 13–15, 2015, IEEE, pp. 1–8. doi:10.1109/IRI.2015.12 (2015) Richter, A.N., Khoshgoftaar, T.M., Landset, S., Hasanin, T.: A multi-dimensional comparison of toolkits for machine learning with big data. In: 2015 IEEE International Conference on Information Reuse and Integration, IRI 2015, San Francisco, CA, USA, August 13–15, 2015, IEEE, pp. 1–8. doi:10.​1109/​IRI.​2015.​12 (2015)
71.
Zurück zum Zitat Salperwyck, C., Maby, S., Cubillé, J., Lagacherie, M.: Courbospark: Decision tree for time-series on spark. In: Proceedings of the 1st International Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2015, co-located with The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2015), Porto, Portugal, September 11, 2015. http://ceur-ws.org/Vol-1425/paper15.pdf (2015) Salperwyck, C., Maby, S., Cubillé, J., Lagacherie, M.: Courbospark: Decision tree for time-series on spark. In: Proceedings of the 1st International Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2015, co-located with The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2015), Porto, Portugal, September 11, 2015. http://​ceur-ws.​org/​Vol-1425/​paper15.​pdf (2015)
72.
Zurück zum Zitat Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015). doi:10.14778/2831360.2831365 CrossRef Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015). doi:10.​14778/​2831360.​2831365 CrossRef
74.
Zurück zum Zitat Sparks, E.R., Talwalkar, A., Franklin, M.J., Jordan, M.I., Kraska, T.: Tupaq: An efficient planner for large-scale predictive analytic queries. CoRR arXiv:1502.00068 (2015) Sparks, E.R., Talwalkar, A., Franklin, M.J., Jordan, M.I., Kraska, T.: Tupaq: An efficient planner for large-scale predictive analytic queries. CoRR arXiv:​1502.​00068 (2015)
75.
Zurück zum Zitat Sparks, E.R., Talwalkar, A., Haas, D., Franklin, M.J., Jordan, M.I., Kraska, T.: Automating model search for large scale machine learning. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’15, pp. 368–380. doi:10.1145/2806777.2806945 (2015) Sparks, E.R., Talwalkar, A., Haas, D., Franklin, M.J., Jordan, M.I., Kraska, T.: Automating model search for large scale machine learning. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’15, pp. 368–380. doi:10.​1145/​2806777.​2806945 (2015)
76.
Zurück zum Zitat Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, New York, NY, USA, SOCC ’13, pp. 5:1–5:16. doi:10.1145/2523616.2523633 (2013) Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, New York, NY, USA, SOCC ’13, pp. 5:1–5:16. doi:10.​1145/​2523616.​2523633 (2013)
77.
Zurück zum Zitat Venkataraman, S., Yang, Z., Liu, D., Liang, E., Falaki, H., Meng, X., Xin, R., Ghodsi, A., Franklin, M., Stoica, I., Zaharia, M.: Sparkr: Scaling r programs with spark. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp. 1099–1104. doi:10.1145/2882903.2903740 (2016) Venkataraman, S., Yang, Z., Liu, D., Liang, E., Falaki, H., Meng, X., Xin, R., Ghodsi, A., Franklin, M., Stoica, I., Zaharia, M.: Sparkr: Scaling r programs with spark. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp. 1099–1104. doi:10.​1145/​2882903.​2903740 (2016)
78.
Zurück zum Zitat Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), pp. 166–173. doi:10.1109/HPCC-CSS-ICESS.2015.246 (2015) Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), pp. 166–173. doi:10.​1109/​HPCC-CSS-ICESS.​2015.​246 (2015)
83.
Zurück zum Zitat Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, co-loated with SIGMOD/PODS 2013, New York, NY, USA, June 24, 2013, p 2. http://event.cwi.nl/grades2013/02-xin.pdf (2013) Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, co-loated with SIGMOD/PODS 2013, New York, NY, USA, June 24, 2013, p 2. http://​event.​cwi.​nl/​grades2013/​02-xin.​pdf (2013)
84.
Zurück zum Zitat Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’13, pp. 13–24. doi:10.1145/2463676.2465288 (2013) Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’13, pp. 13–24. doi:10.​1145/​2463676.​2465288 (2013)
85.
Zurück zum Zitat Xin, R.S., Crankshaw, D., Dave, A., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: Unifying data-parallel and graph-parallel analytics. CoRR arxiv:1402.2394 (2014) Xin, R.S., Crankshaw, D., Dave, A., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: Unifying data-parallel and graph-parallel analytics. CoRR arxiv:​1402.​2394 (2014)
86.
Zurück zum Zitat Yan, D., Cheng, J., Ozsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng,W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016). doi:10.14778/2904483.2904488 Yan, D., Cheng, J., Ozsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng,W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016). doi:10.​14778/​2904483.​2904488
88.
Zurück zum Zitat Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 1–14. http://dl.acm.org/citation.cfm?id=1855741.1855742 (2008) Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 1–14. http://​dl.​acm.​org/​citation.​cfm?​id=​1855741.​1855742 (2008)
89.
Zurück zum Zitat Zadeh, R.B., Meng, X., Yavuz, B., Staple, A., Pu, L., Venkataraman, S., Sparks, E., Ulanov, A., Zaharia, M.: linalg: Matrix computations in apache spark. arxiv:1509.02256 (2015) Zadeh, R.B., Meng, X., Yavuz, B., Staple, A., Pu, L., Venkataraman, S., Sparks, E., Ulanov, A., Zaharia, M.: linalg: Matrix computations in apache spark. arxiv:​1509.​02256 (2015)
90.
Zurück zum Zitat Zaharia, M.: An Architecture for Fast and General Data Processing on Large Clusters. Association for Computing Machinery, New York, NY, USA (2016)CrossRef Zaharia, M.: An Architecture for Fast and General Data Processing on Large Clusters. Association for Computing Machinery, New York, NY, USA (2016)CrossRef
94.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., Dave, A.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation pp. 2–2. doi:10.1111/j.1095-8649.2005.00662.x (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation pp. 2–2. doi:10.​1111/​j.​1095-8649.​2005.​00662.​x (2012)
95.
Zurück zum Zitat Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’13, pp. 423–438. doi:10.1145/2517349.2522737 (2013) Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’13, pp. 423–438. doi:10.​1145/​2517349.​2522737 (2013)
96.
Zurück zum Zitat Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. CoRR arXiv:1506.07552 (2015) Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. CoRR arXiv:​1506.​07552 (2015)
97.
Zurück zum Zitat Zhao, G., Ling, C., Sun, D.: Sparksw: Scalable distributed computing system for large-scale biological sequence alignment. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4–7, 2015, IEEE Computer Society, pp. 845–852. doi:10.1109/CCGrid.2015.55 (2015) Zhao, G., Ling, C., Sun, D.: Sparksw: Scalable distributed computing system for large-scale biological sequence alignment. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4–7, 2015, IEEE Computer Society, pp. 845–852. doi:10.​1109/​CCGrid.​2015.​55 (2015)
98.
Zurück zum Zitat Zhu, B., Mara, A., Mozo, A.: New Trends in Databases and Information Systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap CLUS: Parallel Subspace Clustering Algorithm on Spark, pp. 175–185 (2015) Zhu, B., Mara, A., Mozo, A.: New Trends in Databases and Information Systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap CLUS: Parallel Subspace Clustering Algorithm on Spark, pp. 175–185 (2015)
Metadaten
Titel
Big data analytics on Apache Spark
verfasst von
Salman Salloum
Ruslan Dautov
Xiaojun Chen
Patrick Xiaogang Peng
Joshua Zhexue Huang
Publikationsdatum
01.11.2016
Verlag
Springer International Publishing
Erschienen in
International Journal of Data Science and Analytics / Ausgabe 3-4/2016
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI
https://doi.org/10.1007/s41060-016-0027-9

Weitere Artikel der Ausgabe 3-4/2016

International Journal of Data Science and Analytics 3-4/2016 Zur Ausgabe

Premium Partner