nach oben

International Journal of Data Science and Analytics

Erschienen in:

01.11.2016 | Review

Big data analytics on Apache Spark

verfasst von: Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, Joshua Zhexue Huang

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 3-4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

Vorheriger Artikel The good, the bad, and the ugly: uncovering novel research opportunities in social media mining

Nächster Artikel Social-group-based ranking algorithms for cold-start video recommendation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://spark.apache.org/docs/latest/programming-guide.html.

https://www.gitbook.com/book/databricks/databricks-spark-reference-applications.

https://databricks.com/blog/.

https://sparkhub.databricks.com.

https://www.openhub.net/p/apache-spark.

http://mlbase.org/.

http://keystone-ml.org/.

https://amplab.cs.berkeley.edu/projects/graphx/.

http://spark-packages.org/.

https://databricks.com/.

https://github.com/HuaweiBigData/astro.

http://www.spark.tc/.

https://developer.ibm.com/open/systemml/.

Microsoft Announcement: http://tinyurl.com/gmjwan9.

https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/.

https://azure.microsoft.com/services/hdinsight/r-server/.

http://sortbenchmark.org/.

http://mattturck.com/2016/02/01/big-data-landscape/.

https://github.com/datastax/spark-cassandra-connector.

https://spark.apache.org/docs/latest/cluster-overview.html.

http://spark-packages.org/.

Supplemental Spark Projects: http://tinyurl.com/j4z3ppl.

http://mlbase.org/.

http://keystone-ml.org/.

http://www.alluxio.org/.

https://spark.apache.org/docs/latest/sparkr.html.

http://blinkdb.org/.

https://github.com/spark-jobserver/spark-jobserver.

https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

https://github.com/graphframes/.

https://spark.apache.org/docs/latest/ml-guide.html.

https://spark.apache.org/docs/latest/mllib-optimization.html.

https://spark.apache.org/docs/latest/ml-advanced.html.

https://amplab.cs.berkeley.edu/projects/keystoneml/.

http://systemml.apache.org/.

http://zhangyuc.github.io/splash/.

http://akka.io/.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html.

https://spark.apache.org/docs/latest/streaming-kafka-integration.html.

https://spark.apache.org/docs/latest/streaming-flume-integration.html.

https://spark.apache.org/docs/latest/streaming-kinesis-integration.html.

https://dev.twitter.com/docs/streaming-apis.

http://huawei-noah.github.io/streamDM/.

https://flink.apache.org/.

http://stratosphere.eu/.

https://storm.apache.org/.

https://github.com/SparkTC/spark-bench.

https://github.com/intel-hadoop/HiBench.

https://github.com/databricks/spark-perf.

https://github.com/intel-hadoop/Big-Bench.

https://github.com/yahoo/streaming-benchmarks.

https://yahooeng.tumblr.com/post/135321837876.

https://github.com/databricks/spark-sql-perf.

http://prof.ict.ac.cn/BigDataBench/.

https://kayousterhout.github.io/trace-analysis/.

https://community.cloud.databricks.com/.

https://spark.apache.org/releases/spark-release-2-0-0.html.

https://sites.google.com/site/sparkbigdebug/.

https://github.com/yahoo/CaffeOnSpark.

http://tinyurl.com/zpn4qay.

https://github.com/Microsoft/Mobius.

Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, pp 29–42 (2013). doi:10.1145/2465351.2465355

Amde, M., Bradley, J.: Scalable decision trees in mllib. https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html (2014)

Anagnostopoulos, I., Zeadally, S., Exposito, E.: Handling big data: research challenges and future directions. J. Supercomput. (2016). doi:10.1007/s11227-016-1677-z

Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: International Conference on Machine Learning (2007)

Apiletti, D., Garza, P., Pulvirenti, F.: New Trends in databases and information systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap A Review of Scalable Approaches for Frequent Itemset Mining, pp. 243–247 (2015)

Aridhi, S., Nguifo, E.M.: Big graph mining: frameworks and techniques. arXiv preprint arXiv:1602.03072 (2016)

Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data—SIGMOD ’15, ACM Press, New York, NY, USA, pp. 1383–1394. doi:10.1145/2723372.2742797. http://dl.acm.org/citation.cfm?id=2723372.2742797 (2015)

Armbrust, M., Huai, Y., Liang, C., Xin, R., Zaharia, M.: Deep dive into spark sqls catalyst optimizer. https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html (2015)

Armbrust, M., Fan, W., Xin, R., Zaharia, M.: Introducing spark datasets. https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html (2016)

10.

Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: How data volume affects spark based data analytics on a scale-up server. CoRR arxiv:1507.08340 (2015)

11.

Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: Apache spark case study. CoRR arXiv:1604.08484 (2016)

12.

Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemML. Proc. VLDB Endow. 7(7), 553–564 (2014). doi:10.14778/2732286.2732292 CrossRef

13.

Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010). doi:10.14778/1920841.1920881 CrossRef

14.

Burdorf, C.: Use of spark mllib for predicting the offlining of digital media. Presentation. https://spark-summit.org/2015/events/use-of-spark-mllib-for-predicting-the-offlining-of-digital-media/ (2015)

15.

Busa, N.: Real-time anomaly detection with spark ml and akka. Presentation. https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/ (2015)

16.

Capotă, M, Hegeman, T., Iosup, A., Prat-Pérez, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of the GRADES’15, ACM, New York, NY, USA, GRADES’15, pp. 7:1–7:6. doi:10.1145/2764947.2764954 (2015)

17.

Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-mat: a recursive model for graph mining. In: In Fourth SIAM International Conference on Data Mining (2004)

18.

Chan, W.: Databricks democratizes data and reduces infrastructure costs for eyeview. https://databricks.com/blog/2016/02/03/databricks-democratizes-data-and-reduces-infrastructure-costs-for-eyeview.html (2016)

19.

Cheng, R., Chen, E., Hong, J., Kyrola, A., Miao, Y., Weng, X., Wu, M., Yang, F., Zhou, L., Zhao, F.: Kineograph. In: Proceedings of the 7th ACM european conference on Computer Systems—EuroSys ’12, ACM Press, New York, NY, USA, p 85. doi:10.1145/2168836.2168846. http://dl.acm.org/citation.cfm?id=2168836.2168846 (2012)

20.

Crankshaw, D., Bailis, P., Gonzalez, J.E., Li, H., Zhang, Z., Franklin, M.J., Ghodsi, A., Jordan, M.I.: The missing piece in complex analytics: low latency, scalable model management and serving with velox. CoRR arxiv:1409.3809 (2014)

21.

Damji, J.: A tale of three apache spark apis: Rdds, dataframes, and datasets. https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html (2016)

22.

Das, T., Zaharia, M., Wendell, P.: Diving into spark streaming’s execution model. https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html (2015)

23.

Databricks: Databricks spark reference applications. http://tinyurl.com/gwzkqxr (2015)

24.

Dave, A.: Graphframes: graph queries in spark sql. Presentation. https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/ (2016)

25.

Dave, A., Jindal, A., Li, L.E., Xin, R., Gonzalez, J., Zaharia, M.: Graphframes: an integrated api for mixing graph and relational queries. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 2:1–2:8. doi:10.1145/2960414.2960416 (2016)

26.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: A runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, New York, NY, USA, HPDC ’10, pp. 810–818. doi:10.1145/1851476.1851593 (2010)

27.

Fernndez, A., del Ro, S., Lpez, V., Bawakid, A., del Jesus, M.J., Bentez, J.M., Herrera, F.: Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(5), 380–409 (2014). doi:10.1002/widm.1134 CrossRef

28.

Freeman, J.: A platform for large-scale neuroscience. Presentation. https://spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience (2014)

29.

Freeman, J.: Introducing streaming k-means in spark 1.2. https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html (2015)

30.

Freeman, J.: Open source tools for large-scale neuroscience. Curr. Opin. Neurobiol. 32, 156–163 (2015). doi:10.1016/j.conb.2015.04.002. large-Scale Recording Technology (32)CrossRef

31.

Ganelin, l: Spark: Big Data Cluster Computing in Production. Wiley, New York (2016)CrossRef

32.

Ghoting, A., Krishnamurthy, R., Pednault, E.P.D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: Systemml: Declarative machine learning on mapreduce. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K. (eds.) Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, IEEE Computer Society, pp. 231–242. doi:10.1109/ICDE.2011.5767930 (2011)

33.

Gonzalez, J.E.: From graphs to tables the design of scalable systems for graph analytics. In: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, Companion Volume, pp. 1149–1150. doi:10.1145/2567948.2580059 (2014)

34.

Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs, pp. 17–30. http://dl.acm.org/citation.cfm?id=2387880.2387883 (2012)

35.

Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’14, pp. 599–613. http://dl.acm.org/citation.cfm?id=2685048.2685096 (2014)

36.

Gopalani, S., Arora, R.: Article: Comparing apache spark and map reduce with performance analysis using k-means. Int. J. Comput. Appl. 113(1), 8–11 (2015). (full text available)

37.

Guller, M.: Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis. Apress. https://books.google.de/books?id=bNP8rQEACAAJ (2015)

38.

Gulzar, M.A., Interlandi, M., Yoo, S., Tetali, S.D., Condie, T., Millstein, T., Kim, M.: Bigdebug: debugging primitives for interactive big data processing in spark. In: Proceedings of 38th IEEE/ACM International Conference on Software Engineering, ICSE’ 16 (2016)

39.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’11, pp. 295–308. http://dl.acm.org/citation.cfm?id=1972457.1972488 (2011)

40.

Huang, M.: Dynamic community detection for large-scale e-commerce data with spark streaming and graphx. Presentation. https://spark-summit.org/2015/events/hybrid-community-detection-for-web-scale-e-commerce-using-spark-streaming-and-graphx/ (2015)

41.

Interlandi, M., Shah, K., Tetali, S.D., Gulzar, M., Yoo, S., Kim, M., Millstein, T.D., Condie, T.: Titian: Data provenance support in spark. PVLDB 9(3), 216–227. http://www.vldb.org/pvldb/vol9/p216-interlandi.pdf (2015)

42.

Ivanov, T., Beer, M.: Evaluating Hive and spark SQL with bigbench. CoRR arXiv:1512.08417 (2015)

43.

Iyer, A.P., Li, L.E., Das, T., Stoica, I.: Time-evolving graph processing at scale. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, ACM, New York, NY, USA, GRADES ’16, pp. 5:1–5:6. doi:10.1145/2960414.2960419 (2016)

44.

Jarrah, M., Al-Quraan, M., Jararweh, Y., Al-Ayyoub, M.: Medgraph: a graph-based representation and computation to handle large sets of images. Multimedia Tools and Applications, pp. 1–17. doi:10.1007/s11042-016-3262-0 (2016)

45.

Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Inc, Sebastopol (2015)

46.

Kim, H., Park, J., Jang, J., Yoon, S.: Deepspark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. CoRR arXiv:1602.08191 (2016)

47.

Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, ACM, pp. 426–434. doi:10.1145/1401890.1401944 (2008)

48.

Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR. www.cidrdb.org. http://dblp.uni-trier.de/db/conf/cidr/cidr2013.html (2013)

49.

Krishnan, D.R., Quoc, D.L., Bhatotia, P., Fetzer, C., Rodrigues, R.: Incapprox: A data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1133–1144 (2016)

50.

Kursar, B.: Data driven—toyota customer 360 insights on apache spark and mllib. Presentation. https://spark-summit.org/2015/events/keynote-7/ (2015)

51.

Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 1–36 (2015). doi:10.1186/s40537-015-0032-1 CrossRef

52.

Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: Reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, ACM, pp. 1–15 (2014)

53.

Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench. In: Proceedings of the 12th ACM International Conference on Computing Frontiers—CF ’15, ACM Press, New York, New York, USA, pp. 1–8. doi:10.1145/2742854.2747283 (2015)

54.

Li, P., Luo, Y., Zhang, N., Cao, Y.: Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms. In: 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 347–348. doi:10.1109/NAS.2015.7255222 (2015)

55.

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: A New Framework for Parallel Machine Learning, pp. 8–11. arxiv:1006.4990 (2010)

56.

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi:10.14778/2212351.2212354 CrossRef

57.

Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel. In: Proceedings of the 2010 International Conference on Management of data—SIGMOD ’10, ACM Press, New York, NY, USA, p 135. http://dl.acm.org/citation.cfm?id=1807167.1807184 (2010)

58.

Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’10, pp. 135–146. doi:10.1145/1807167.1807184 (2010)

59.

Marcu, O.C., Costan, A., Antoniu, G., Pérez, M.S.: Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. In: Cluster 2016—The IEEE 2016 International Conference on Cluster Computing, Taipei, Taiwan. https://hal.inria.fr/hal-01347638 (2016)

60.

Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.A.: Adam: Genomics formats and processing patterns for butt scale computing. Tech. Rep. UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013)

61.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: Machine learning in apache spark arXiv:1505.06807 (2015)

62.

Moffitt, V.Z., Stoyanovich, J.: Portal: a query language for evolving graphs. arXiv preprint arXiv:1602.00773 (2016)

63.

Moffitt, V.Z., Stoyanovich, J.: Towards a distributed infrastructure for evolving graph analytics. https://www.cs.drexel.edu/~julia/documents/tempweb16.pdf (2016)

64.

Moritz, P., Nishihara, R., Stoica, I., Jordan, M.I.: Sparknet: Training deep networks in spark. CoRR arXiv:1511.06051 (2015)

65.

O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: Variantspark: population scale clustering of genotype information. BMC Genom. 16(1), 1–9 (2015). doi:10.1186/s12864-015-2269-7 CrossRef

66.

Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’15, pp. 293–307. http://dl.acm.org/citation.cfm?id=2789770.2789791 (2015)

67.

Palamuttam, R., Mogrovejo, R.M., Mattmann, C., Wilson, B., Whitehall, K., Verma, R., McGibbney, L.J., Ramirez, P.M.: Scispark: applying in-memory distributed computing to weather event detection and tracking. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29-November 1, 2015, IEEE, pp. 2020–2026. doi:10.1109/BigData.2015.7363983 (2015)

68.

Ramrez-Gallego, S., Garca, S., Mourio-Taln, H., Martnez-Rego, D., Boln-Canedo, V., Alonso-Betanzos, A., Bentez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173 CrossRef

69.

Richter, A.N., Khoshgoftaar, T.M., Landset, S., Hasanin, T.: A multi-dimensional comparison of toolkits for machine learning with big data. In: 2015 IEEE International Conference on Information Reuse and Integration, IRI 2015, San Francisco, CA, USA, August 13–15, 2015, IEEE, pp. 1–8. doi:10.1109/IRI.2015.12 (2015)

70.

Ryza, S., Laserson, U., Owen, S., Wills, J.: Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O’Reilly Media. https://books.google.de/books?id=M0_GBwAAQBAJ (2015)

71.

Salperwyck, C., Maby, S., Cubillé, J., Lagacherie, M.: Courbospark: Decision tree for time-series on spark. In: Proceedings of the 1st International Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2015, co-located with The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2015), Porto, Portugal, September 11, 2015. http://ceur-ws.org/Vol-1425/paper15.pdf (2015)

72.

Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015). doi:10.14778/2831360.2831365 CrossRef

73.

Shyam, R., Kumar, S., Poornachandran, P., Soman, K.P.: Apache spark a big data analytics platform for smart grid. Proc. Technol. 21, 171–178 (2015). doi:10.1016/j.protcy.2015.10.085 CrossRef

74.

Sparks, E.R., Talwalkar, A., Franklin, M.J., Jordan, M.I., Kraska, T.: Tupaq: An efficient planner for large-scale predictive analytic queries. CoRR arXiv:1502.00068 (2015)

75.

Sparks, E.R., Talwalkar, A., Haas, D., Franklin, M.J., Jordan, M.I., Kraska, T.: Automating model search for large scale machine learning. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’15, pp. 368–380. doi:10.1145/2806777.2806945 (2015)

76.

Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, New York, NY, USA, SOCC ’13, pp. 5:1–5:16. doi:10.1145/2523616.2523633 (2013)

77.

Venkataraman, S., Yang, Z., Liu, D., Liang, E., Falaki, H., Meng, X., Xin, R., Ghodsi, A., Franklin, M., Stoica, I., Zaharia, M.: Sparkr: Scaling r programs with spark. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp. 1099–1104. doi:10.1145/2882903.2903740 (2016)

78.

Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), pp. 166–173. doi:10.1109/HPCC-CSS-ICESS.2015.246 (2015)

79.

Xiao, B.: Huawei embraces open-source apache spark. https://databricks.com/blog/2015/06/09/huawei-embraces-open-source-apache-spark.html (2015)

80.

Xin, R.: Spark officially sets a new record in large-scale sorting. https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html (2014)

81.

Xin, R.: Technical preview of apache spark 2.0 now on databricks. https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html (2016)

82.

Xin, R., Rosen, J.: Project tungsten: Bringing spark closer to bare metal. Presentation. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html (2015)

83.

Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, co-loated with SIGMOD/PODS 2013, New York, NY, USA, June 24, 2013, p 2. http://event.cwi.nl/grades2013/02-xin.pdf (2013)

84.

Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’13, pp. 13–24. doi:10.1145/2463676.2465288 (2013)

85.

Xin, R.S., Crankshaw, D., Dave, A., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: Unifying data-parallel and graph-parallel analytics. CoRR arxiv:1402.2394 (2014)

86.

Yan, D., Cheng, J., Ozsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng,W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016). doi:10.14778/2904483.2904488

87.

Yu, J., Jinxuan, W., Mohamed, S.: GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In: 23th International Conference on Advances in Geographic Information Systems. http://www.public.asu.edu/~jinxuanw/papers/GeoSpark.pdf (2015)

88.

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 1–14. http://dl.acm.org/citation.cfm?id=1855741.1855742 (2008)

89.

Zadeh, R.B., Meng, X., Yavuz, B., Staple, A., Pu, L., Venkataraman, S., Sparks, E., Ulanov, A., Zaharia, M.: linalg: Matrix computations in apache spark. arxiv:1509.02256 (2015)

90.

Zaharia, M.: An Architecture for Fast and General Data Processing on Large Clusters. Association for Computing Machinery, New York, NY, USA (2016)CrossRef

91.

Zaharia, M.: Spark 2.0. Presentation. http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia (2016)

92.

Zaharia, M., Wendell, P.: Spark community update. Presentation. http://www.slideshare.net/databricks/spark-community-update-spark-summit-san-francisco-2015 (2015)

93.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets p 10. http://dl.acm.org/citation.cfm?id=1863103.1863113 (2010)

94.

Zaharia, M., Chowdhury, M., Das, T., Dave, A.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation pp. 2–2. doi:10.1111/j.1095-8649.2005.00662.x (2012)

95.

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’13, pp. 423–438. doi:10.1145/2517349.2522737 (2013)

96.

Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. CoRR arXiv:1506.07552 (2015)

97.

Zhao, G., Ling, C., Sun, D.: Sparksw: Scalable distributed computing system for large-scale biological sequence alignment. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4–7, 2015, IEEE Computer Society, pp. 845–852. doi:10.1109/CCGrid.2015.55 (2015)

98.

Zhu, B., Mara, A., Mozo, A.: New Trends in Databases and Information Systems: ADBIS 2015 Short Papers and Workshops, BigDap, DCSA, GID, MEBIS, OAIS, SW4CH, WISARD, Poitiers, France, September 8–11, 2015. Proceedings, Springer International Publishing, Cham, chap CLUS: Parallel Subspace Clustering Algorithm on Spark, pp. 175–185 (2015)

Titel: Big data analytics on Apache Spark
verfasst von: Salman Salloum
Ruslan Dautov
Xiaojun Chen
Patrick Xiaogang Peng
Joshua Zhexue Huang
Publikationsdatum: 01.11.2016
Verlag: Springer International Publishing
Erschienen in: International Journal of Data Science and Analytics / Ausgabe 3-4/2016
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-016-0027-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3-4/2016

A new data science research program: evaluation, metrology, standards, and community outreach

Backbone discovery in traffic networks

Social-group-based ranking algorithms for cold-start video recommendation

Exact and approximate Boolean matrix decomposition with column-use condition

The good, the bad, and the ugly: uncovering novel research opportunities in social media mining

Premium Partner