Skip to main content
Erschienen in: International Journal of Parallel Programming 5/2014

01.10.2014

Parallel Programming Paradigms and Frameworks in Big Data Era

verfasst von: Ciprian Dobre, Fatos Xhafa

Erschienen in: International Journal of Parallel Programming | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines—ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data—typically of heterogeneous nature—in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
To understand the complexity in working with such amounts of data, think of what would happen if someone accidentally pushes the Print button and 1 ZettaByte of data would be printed on paper. Actually, this amount of printed information would weigh about 1,016 pounds or \(5 \times \hbox {1,010}\) tonnes. One ZettaByte of equivalent books would fill up 10 billion Trucks or 500,000 aircraft carriers, and if equally distributed they would mean 10,000 books for each person living on the planet today. To make just the paper to print on would require 3 times the number of trees in the world today [4].
 
2
Various experts predict that the World Wide Web might already contain 1 ZettaByte of information.
 
Literatur
1.
Zurück zum Zitat Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. Int. J. Very Large Data Bases 12(2), 120–139 (2003)CrossRef Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. Int. J. Very Large Data Bases 12(2), 120–139 (2003)CrossRef
3.
Zurück zum Zitat Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)CrossRef Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)CrossRef
6.
Zurück zum Zitat Cortes, C., Fisher, K., Pregibon, D., Rogers, A.: Hancock: a language for extracting signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–17. ACM (2000) Cortes, C., Fisher, K., Pregibon, D., Rogers, A.: Hancock: a language for extracting signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–17. ACM (2000)
7.
Zurück zum Zitat Darema, F.: The spmd model: past, present and future. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 1–1. Springer, Berlin (2001) Darema, F.: The spmd model: past, present and future. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 1–1. Springer, Berlin (2001)
8.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
9.
Zurück zum Zitat Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 155–163. IEEE (2012) Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 155–163. IEEE (2012)
10.
Zurück zum Zitat Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010) Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)
11.
Zurück zum Zitat Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: IEEE Fourth International Conference on eScience 2008 (eScience’08), pp. 277–284. IEEE (2008) Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: IEEE Fourth International Conference on eScience 2008 (eScience’08), pp. 277–284. IEEE (2008)
12.
Zurück zum Zitat Fox, G., Bae, S.H., Ekanayake, J., Qiu, X., Yuan, H.: Parallel data mining from multicore to cloudy grids. In: High Performance Computing Workshop, vol. 18, pp. 311–340 (2009) Fox, G., Bae, S.H., Ekanayake, J., Qiu, X., Yuan, H.: Parallel data mining from multicore to cloudy grids. In: High Performance Computing Workshop, vol. 18, pp. 311–340 (2009)
14.
Zurück zum Zitat Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179. IEEE (2012) Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179. IEEE (2012)
15.
Zurück zum Zitat Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM (2003) Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM (2003)
17.
Zurück zum Zitat Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 22–22. USENIX Association (2011) Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 22–22. USENIX Association (2011)
18.
Zurück zum Zitat Hindman, B., Konwinski, A., Zaharia, M., Stoica, I.: A common substrate for cluster computing. In: Workshop on Hot Topics in Cloud Computing (HotCloud), vol. 2009 (2009) Hindman, B., Konwinski, A., Zaharia, M., Stoica, I.: A common substrate for cluster computing. In: Workshop on Hot Topics in Cloud Computing (HotCloud), vol. 2009 (2009)
21.
Zurück zum Zitat Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)CrossRef Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)CrossRef
22.
Zurück zum Zitat Krishnan, S.: Programming Windows Azure. O’Reilly (2010) Krishnan, S.: Programming Windows Azure. O’Reilly (2010)
23.
Zurück zum Zitat Lämmel, R.: Googles mapreduce programming modelrevisited. Sci. Comput. Program. 70(1), 1–30 (2008)CrossRefMATH Lämmel, R.: Googles mapreduce programming modelrevisited. Sci. Comput. Program. 70(1), 1–30 (2008)CrossRefMATH
24.
Zurück zum Zitat Markoff, J.: Google cars drive themselves, in traffic. N.Y. Times 10, A1 (2010) Markoff, J.: Google cars drive themselves, in traffic. N.Y. Times 10, A1 (2010)
26.
Zurück zum Zitat Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 170–177. IEEE (2010) Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 170–177. IEEE (2010)
28.
Zurück zum Zitat Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008) Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
29.
Zurück zum Zitat Paskaleva, K.A.: Enabling the smart city: the progress of city e-governance in europe. Int. J. Innov. Reg. Dev. 1(4), 405–422 (2009)CrossRef Paskaleva, K.A.: Enabling the smart city: the progress of city e-governance in europe. Int. J. Innov. Reg. Dev. 1(4), 405–422 (2009)CrossRef
30.
Zurück zum Zitat Patterson, D.A.: The data center is the computer. Commun. ACM 51(1), 105–105 (2008)CrossRef Patterson, D.A.: The data center is the computer. Commun. ACM 51(1), 105–105 (2008)CrossRef
31.
Zurück zum Zitat Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009) Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009)
32.
Zurück zum Zitat Pierre, G., Stratan, C.: Conpaas: a platform for hosting elastic cloud applications. IEEE Internet Comput. 16(5), 88–92 (2012)CrossRef Pierre, G., Stratan, C.: Conpaas: a platform for hosting elastic cloud applications. IEEE Internet Comput. 16(5), 88–92 (2012)CrossRef
33.
Zurück zum Zitat Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005) Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)
34.
Zurück zum Zitat Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI, pp. 293–306 (2010) Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI, pp. 293–306 (2010)
35.
Zurück zum Zitat Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, 2008 (MTAGS 2008). pp. 1–11. IEEE (2008) Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, 2008 (MTAGS 2008). pp. 1–11. IEEE (2008)
38.
Zurück zum Zitat Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010) Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
39.
Zurück zum Zitat Tudoran, R., Costan, A., Antoniu, G.: Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In: Proceedings of Third International Workshop on MapReduce and Its Applications Date, pp. 9–16. ACM (2012) Tudoran, R., Costan, A., Antoniu, G.: Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In: Proceedings of Third International Workshop on MapReduce and Its Applications Date, pp. 9–16. ACM (2012)
40.
Zurück zum Zitat Vrbić, R.: Data mining and cloud computing. JITA—J. Inf. Technol. Appl. (Banja Luka)-APEIRON 4(2), 75–87 (2012) Vrbić, R.: Data mining and cloud computing. JITA—J. Inf. Technol. Appl. (Banja Luka)-APEIRON 4(2), 75–87 (2012)
41.
Zurück zum Zitat Waas, F.M.: Beyond conventional data warehousingmassively parallel data processing with greenplum database. In: Business Intelligence for the Real-Time Enterprise, pp. 89–96. Springer, Berlin (2009) Waas, F.M.: Beyond conventional data warehousingmassively parallel data processing with greenplum database. In: Business Intelligence for the Real-Time Enterprise, pp. 89–96. Springer, Berlin (2009)
43.
Zurück zum Zitat Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, p. 8. ACM (2009) Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, p. 8. ACM (2009)
44.
Zurück zum Zitat Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007) Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
45.
Zurück zum Zitat Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008) Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)
46.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)
47.
Zurück zum Zitat Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008) Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008)
Metadaten
Titel
Parallel Programming Paradigms and Frameworks in Big Data Era
verfasst von
Ciprian Dobre
Fatos Xhafa
Publikationsdatum
01.10.2014
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 5/2014
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-013-0272-7

Weitere Artikel der Ausgabe 5/2014

International Journal of Parallel Programming 5/2014 Zur Ausgabe

Premium Partner