nach oben

International Journal of Parallel Programming

Erschienen in:

01.10.2014

Parallel Programming Paradigms and Frameworks in Big Data Era

verfasst von: Ciprian Dobre, Fatos Xhafa

Erschienen in: International Journal of Parallel Programming | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines—ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data—typically of heterogeneous nature—in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.

Vorheriger Artikel Nature-Inspired Meta-Heuristics on Modern GPUs: State of the Art and Brief Survey of Selected Algorithms

Nächster Artikel Task-Based System Load Balancing in Cloud Computing Using Particle Swarm Optimization

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

To understand the complexity in working with such amounts of data, think of what would happen if someone accidentally pushes the Print button and 1 ZettaByte of data would be printed on paper. Actually, this amount of printed information would weigh about 1,016 pounds or \(5 \times \hbox {1,010}\) tonnes. One ZettaByte of equivalent books would fill up 10 billion Trucks or 500,000 aircraft carriers, and if equally distributed they would mean 10,000 books for each person living on the planet today. To make just the paper to print on would require 3 times the number of trees in the world today [4].

Various experts predict that the World Wide Web might already contain 1 ZettaByte of information.

Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. Int. J. Very Large Data Bases 12(2), 120–139 (2003)CrossRef

Beckhusen, R.: So it begins: Darpa sets out to make computers that can teach themselves. http://www.wired.com/dangerroom/2013/03/darpa-machine-learning-2/all/1 (2013). Accessed 18 Apr 2013

Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)CrossRef

Berkan, R.: Big Data: a blessing and a curse. http://www.searchenginejournal.com/big-data-blessing/53528/ (2012). Accessed 15 Apr 2013

Cisco: Cisco visual networking index: Global mobile data traffic forecast update, 2011–2016. http://www.cisco.com/ (2012). Accessed 16 Apr 2013

Cortes, C., Fisher, K., Pregibon, D., Rogers, A.: Hancock: a language for extracting signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–17. ACM (2000)

Darema, F.: The spmd model: past, present and future. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 1–1. Springer, Berlin (2001)

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef

Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 155–163. IEEE (2012)

10.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)

11.

Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: IEEE Fourth International Conference on eScience 2008 (eScience’08), pp. 277–284. IEEE (2008)

12.

Fox, G., Bae, S.H., Ekanayake, J., Qiu, X., Yuan, H.: Parallel data mining from multicore to cloudy grids. In: High Performance Computing Workshop, vol. 18, pp. 311–340 (2009)

13.

Frank, C.: Forbes: Improving Decision Making in the World of Big Data. http://www.forbes.com/sites/christopherfrank/2012/03/25/improving-decision-making-in-the-world-of-big-data/ (2012). Accessed 15 Apr 2013

14.

Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179. IEEE (2012)

15.

Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM (2003)

16.

Hayler, A.: ‘big data’ applications bring new database choices, challenges. http://www.computerweekly.com/feature/Big-data-applications-bring-new-database-choices-challenges (2012). Accessed 15 Apr 2013

17.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 22–22. USENIX Association (2011)

18.

Hindman, B., Konwinski, A., Zaharia, M., Stoica, I.: A common substrate for cluster computing. In: Workshop on Hot Topics in Cloud Computing (HotCloud), vol. 2009 (2009)

19.

IBM Omnibond, X.: Big Data implementation: Hadoop and beyond. http://www.datanami.com/whitepapers/ (2013). Accessed 15 June 2013

20.

Inc., G.: Bigquery, Official Website. https://developers.google.com/bigquery/ (2013). Accessed 15 June 2013

21.

Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)CrossRef

22.

Krishnan, S.: Programming Windows Azure. O’Reilly (2010)

23.

Lämmel, R.: Googles mapreduce programming modelrevisited. Sci. Comput. Program. 70(1), 1–30 (2008)CrossRefMATH

24.

Markoff, J.: Google cars drive themselves, in traffic. N.Y. Times 10, A1 (2010)

25.

Metz, C.: Meet the Data Brains Behind the Rise of Facebook. http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/ (2013). Accessed 14 July 2013

26.

Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 170–177. IEEE (2010)

27.

Noseworthy, G.: Infographic: Managing the Big Flood of Big Data in Digital Marketing. http://analyzingmedia.com/2012/infographic-big-flood-of-big-data-in-digital-marketing/ (2012). Accessed 14 Apr 2013

28.

Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

29.

Paskaleva, K.A.: Enabling the smart city: the progress of city e-governance in europe. Int. J. Innov. Reg. Dev. 1(4), 405–422 (2009)CrossRef

30.

Patterson, D.A.: The data center is the computer. Commun. ACM 51(1), 105–105 (2008)CrossRef

31.

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009)

32.

Pierre, G., Stratan, C.: Conpaas: a platform for hosting elastic cloud applications. IEEE Internet Comput. 16(5), 88–92 (2012)CrossRef

33.

Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

34.

Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI, pp. 293–306 (2010)

35.

Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, 2008 (MTAGS 2008). pp. 1–11. IEEE (2008)

36.

Roush, W.: Facebook Doesnt have Big Data. It has Ginormous Data. http://www.xconomy.com/san-francisco/2013/02/14/how-facebook-uses-ginormous-data-to-grow-its-business/2/ (2013). Accessed 14 July 2013

37.

Schatz, M.C.: Blastreduce: High Performance Short Read Mapping with Mapreduce. University of Maryland. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf

38.

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

39.

Tudoran, R., Costan, A., Antoniu, G.: Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In: Proceedings of Third International Workshop on MapReduce and Its Applications Date, pp. 9–16. ACM (2012)

40.

Vrbić, R.: Data mining and cloud computing. JITA—J. Inf. Technol. Appl. (Banja Luka)-APEIRON 4(2), 75–87 (2012)

41.

Waas, F.M.: Beyond conventional data warehousingmassively parallel data processing with greenplum database. In: Business Intelligence for the Real-Time Enterprise, pp. 89–96. Springer, Berlin (2009)

42.

Wampler, D.: Programming trends to watch: logic and probabilistic programming. http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/ (2013). Accessed 18 Apr 2013

43.

Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, p. 8. ACM (2009)

44.

Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)

45.

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)

46.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)

47.

Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008)

Titel: Parallel Programming Paradigms and Frameworks in Big Data Era
verfasst von: Ciprian Dobre
Fatos Xhafa
Publikationsdatum: 01.10.2014
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 5/2014
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-013-0272-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 5/2014

A Cluster-Based Data-Centric Model for Network-Aware Task Scheduling in Distributed Systems

An Optimization-Based Scheme for Efficient Virtual Machine Placement

A Priority-Based Admission Control Scheme for Commercial Web Servers

Parallel Cloud Service Selection and Ranking Based on QoS History

Energy-Efficient Redundant Execution of Processes in a Fault-Tolerant Cluster of Servers

Nature-Inspired Meta-Heuristics on Modern GPUs: State of the Art and Brief Survey of Selected Algorithms

Premium Partner