Skip to main content

2019 | OriginalPaper | Buchkapitel

SLOPE: Structural Locality-Aware Programming Model for Composing Array Data Analysis

verfasst von : Bin Dong, Kesheng Wu, Suren Byna, Houjun Tang

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

MapReduce brought on the Big Data revolution. However, its impact on scientific data analyses has been limited because of fundamental limitations in its data and programming models. Scientific data is typically stored as multidimensional arrays, while MapReduce is based on key-value (KV) pairs. Applying MapReduce to analyze array-based scientific data requires a conversion of arrays to KV pairs. This conversion incurs a large storage overhead and loses structural information embedded in the array. For example, analysis operations, such as convolution, are defined on the neighbors of an array element. Accessing these neighbors is straightforward using array indexes, but requires complex and expensive operations like self-join in the KV data model. In this work, we introduce a novel ‘structural locality’-aware programming model (SLOPE) to compose data analysis directly on multidimensional arrays. We also develop a parallel execution engine for SLOPE to transparently partition the data, to cache intermediate results, to support in-place modification, and to recover from failures. Our evaluations with real applications show that SLOPE is over ninety thousand times faster than Apache Spark and is \(38\%\) faster than TensorFlow.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI 2016 (2016) Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI 2016 (2016)
2.
Zurück zum Zitat Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: The multidimensional database system RasDaMan. SIGMOD Rec. 27(2), 575–577 (1998)CrossRef Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: The multidimensional database system RasDaMan. SIGMOD Rec. 27(2), 575–577 (1998)CrossRef
3.
Zurück zum Zitat Blanas, S., Wu, K., Byna, S., Dong, B., Shoshani, A.: Parallel data analysis directly on scientific file formats. In: SIGMOD 2014 (2014) Blanas, S., Wu, K., Byna, S., Dong, B., Shoshani, A.: Parallel data analysis directly on scientific file formats. In: SIGMOD 2014 (2014)
4.
Zurück zum Zitat Bloom, J.S., Richards, J.W., et al.: Automating discovery and classification of transients and variable stars in the synoptic survey era. PASP 124(921) (2012)CrossRef Bloom, J.S., Richards, J.W., et al.: Automating discovery and classification of transients and variable stars in the synoptic survey era. PASP 124(921) (2012)CrossRef
5.
Zurück zum Zitat Brown, P.G.: Overview of SciDB: large scale array storage, processing and analysis. In: SIGMOD (2010) Brown, P.G.: Overview of SciDB: large scale array storage, processing and analysis. In: SIGMOD (2010)
6.
Zurück zum Zitat Brown, P.G.: Convolution is a database problem (2017) Brown, P.G.: Convolution is a database problem (2017)
7.
Zurück zum Zitat Buck, J.B., Watkins, N., et al.: SciHadoop: array-based query processing in Hadoop. In: Supercomputing Conference (SC) (2011) Buck, J.B., Watkins, N., et al.: SciHadoop: array-based query processing in Hadoop. In: Supercomputing Conference (SC) (2011)
8.
Zurück zum Zitat Byna, S., Chou, J., Rübel, O., Prabhat, Karimabadi, H., et al.: Parallel I/O, analysis, and visualization of a trillion particle simulation. In: SC (2012) Byna, S., Chou, J., Rübel, O., Prabhat, Karimabadi, H., et al.: Parallel I/O, analysis, and visualization of a trillion particle simulation. In: SC (2012)
9.
Zurück zum Zitat Chaimov, N., Malony, A., Canon, S., Iancu, C., et al.: Scaling spark on HPC systems. In: HPDC 2016 (2016) Chaimov, N., Malony, A., Canon, S., Iancu, C., et al.: Scaling spark on HPC systems. In: HPDC 2016 (2016)
10.
Zurück zum Zitat Cornford, S.L., et al.: Adaptive mesh, finite volume modeling of marine ice sheets. J. Comput. Phys. (2013) Cornford, S.L., et al.: Adaptive mesh, finite volume modeling of marine ice sheets. J. Comput. Phys. (2013)
11.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
12.
Zurück zum Zitat Denniston, T., Kamil, S., Amarasinghe, S.: Distributed halide. SIGPLAN Not. 51(8), 5:1–5:12 (2016)CrossRef Denniston, T., Kamil, S., Amarasinghe, S.: Distributed halide. SIGPLAN Not. 51(8), 5:1–5:12 (2016)CrossRef
13.
Zurück zum Zitat Dong, B., Wu, K., Byna, S., Liu, J., Zhao, W., Rusu, F.: ArrayUDF: user-defined scientific data analysis on arrays. In: HPDC (2017) Dong, B., Wu, K., Byna, S., Liu, J., Zhao, W., Rusu, F.: ArrayUDF: user-defined scientific data analysis on arrays. In: HPDC (2017)
14.
Zurück zum Zitat Durlofsky, L.J., Engquist, B., Osher, S.: Triangle based adaptive stencils for the solution of hyperbolic conservation laws. J. Comput. Phys. 98(1), 64–73 (1992)CrossRef Durlofsky, L.J., Engquist, B., Osher, S.: Triangle based adaptive stencils for the solution of hyperbolic conservation laws. J. Comput. Phys. 98(1), 64–73 (1992)CrossRef
16.
Zurück zum Zitat Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRef Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRef
17.
Zurück zum Zitat Gysi, T., Osuna, C., Fuhrer, O., Bianco, M., Schulthess, T.C.: STELLA: a domain-specific tool for structured grid methods in weather and climate models. In: SC 2015 (2015) Gysi, T., Osuna, C., Fuhrer, O., Bianco, M., Schulthess, T.C.: STELLA: a domain-specific tool for structured grid methods in weather and climate models. In: SC 2015 (2015)
18.
Zurück zum Zitat Laoide-Kemp, C.: Investigating MPI streams as an alternative to halo exchange. Technical report, The University of Edinburgh (2014) Laoide-Kemp, C.: Investigating MPI streams as an alternative to halo exchange. Technical report, The University of Edinburgh (2014)
19.
Zurück zum Zitat Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef
20.
Zurück zum Zitat Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: SC (2012) Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: SC (2012)
21.
Zurück zum Zitat Li, J., Liao, W.-K., Choudhary, A., et al.: Parallel netCDF: a high-performance scientific I/O interface. In: SC 2003, p. 39. ACM, New York (2003) Li, J., Liao, W.-K., Choudhary, A., et al.: Parallel netCDF: a high-performance scientific I/O interface. In: SC 2003, p. 39. ACM, New York (2003)
22.
Zurück zum Zitat Li, X., Guo, F., Li, H., Birn, J.: The roles of fluid compression and shear in electron energization during magnetic reconnection (2018)CrossRef Li, X., Guo, F., Li, H., Birn, J.: The roles of fluid compression and shear in electron energization during magnetic reconnection (2018)CrossRef
23.
Zurück zum Zitat Liu, J., Racah, E., Koziol, Q., et al.: H5Spark: bridging the I/O gap between spark and scientific data formats on HPC systems. In: Cray User Group (2016) Liu, J., Racah, E., Koziol, Q., et al.: H5Spark: bridging the I/O gap between spark and scientific data formats on HPC systems. In: Cray User Group (2016)
24.
Zurück zum Zitat Marathe, A.P., Salem, K.: A language for manipulating arrays. In: VLDB (1997) Marathe, A.P., Salem, K.: A language for manipulating arrays. In: VLDB (1997)
25.
Zurück zum Zitat Maruyama, N., et al.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: SC 2011 (2011) Maruyama, N., et al.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: SC 2011 (2011)
26.
Zurück zum Zitat Momjian, B.: PostgreSQL: Introduction and Concepts. Addison-Wesley Longman Publishing Co., Inc., Boston (2001) Momjian, B.: PostgreSQL: Introduction and Concepts. Addison-Wesley Longman Publishing Co., Inc., Boston (2001)
27.
Zurück zum Zitat Racah, E., Beckham, C., Maharaj, T., Kahou, S.E., Prabhat, M., Pal, C.: Extremeweather: a large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In: NIPS (2017) Racah, E., Beckham, C., Maharaj, T., Kahou, S.E., Prabhat, M., Pal, C.: Extremeweather: a large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In: NIPS (2017)
28.
Zurück zum Zitat Racah, E., et al.: Extremeweather: a large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In: NIPS (2017) Racah, E., et al.: Extremeweather: a large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In: NIPS (2017)
29.
Zurück zum Zitat Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on infiniband GPU clusters. In: HiPC (2014) Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on infiniband GPU clusters. In: HiPC (2014)
30.
Zurück zum Zitat Shi, R., et al.: HAND: a hybrid approach to accelerate non-contiguous data movement using MPI datatypes on GPU clusters. In: ICPP (2014) Shi, R., et al.: HAND: a hybrid approach to accelerate non-contiguous data movement using MPI datatypes on GPU clusters. In: ICPP (2014)
31.
Zurück zum Zitat Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: SIGMOD 202011. ACM (2011) Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: SIGMOD 202011. ACM (2011)
32.
Zurück zum Zitat Sousa, M., Dillig, I., Vytiniotis, D., Dillig, T., Gkantsidis, C.: Consolidation of queries with user-defined functions. SIGPLAN Not. 49(6), 554–564 (2014)CrossRef Sousa, M., Dillig, I., Vytiniotis, D., Dillig, T., Gkantsidis, C.: Consolidation of queries with user-defined functions. SIGPLAN Not. 49(6), 554–564 (2014)CrossRef
33.
Zurück zum Zitat Stonebraker, M., et al.: Requirements for science data bases and SciDB. CIDR 7, 173–184 (2009) Stonebraker, M., et al.: Requirements for science data bases and SciDB. CIDR 7, 173–184 (2009)
34.
Zurück zum Zitat Suzuki, K., Horiba, I., Sugie, N.: Linear-time connected-component labeling based on sequential local operations. Comput. Vis. Image Underst. 89(1), 1–23 (2003)CrossRef Suzuki, K., Horiba, I., Sugie, N.: Linear-time connected-component labeling based on sequential local operations. Comput. Vis. Image Underst. 89(1), 1–23 (2003)CrossRef
35.
Zurück zum Zitat Tang, H., Byna, S., et al.: In situ storage layout optimization for AMR spatio-temporal read accesses. In: ICPP (2016) Tang, H., Byna, S., et al.: In situ storage layout optimization for AMR spatio-temporal read accesses. In: ICPP (2016)
36.
Zurück zum Zitat Tang, H., et al.: SoMeta: scalable object-centric metadata management for high performance computing. In: CLUSTER 2017, pp. 359–369. IEEE (2017) Tang, H., et al.: SoMeta: scalable object-centric metadata management for high performance computing. In: CLUSTER 2017, pp. 359–369. IEEE (2017)
37.
Zurück zum Zitat Tang, H., et al.: Toward scalable and asynchronous object-centric data management for HPC. In: CCGRID 2018, pp. 113–122. IEEE (2018) Tang, H., et al.: Toward scalable and asynchronous object-centric data management for HPC. In: CCGRID 2018, pp. 113–122. IEEE (2018)
38.
Zurück zum Zitat The HDF Group. HDF5 User Guide (2010) The HDF Group. HDF5 User Guide (2010)
39.
Zurück zum Zitat Wang, Y., Nandi, A., Agrawal, G.: SAGA: array storage as a DB with support for structural aggregations. In: SSDBM 2014. ACM, New York (2014) Wang, Y., Nandi, A., Agrawal, G.: SAGA: array storage as a DB with support for structural aggregations. In: SSDBM 2014. ACM, New York (2014)
40.
Zurück zum Zitat Wehner, M., Prabhat, et al.: Resolution dependence of future tropical cyclone projections of CAM5.1 in the U.S. CLIVAR hurricane working group idealized configurations. JCLI (2015) Wehner, M., Prabhat, et al.: Resolution dependence of future tropical cyclone projections of CAM5.1 in the U.S. CLIVAR hurricane working group idealized configurations. JCLI (2015)
41.
Zurück zum Zitat Widenius, M., Axmark, D.: MySQL Reference Manual. O’Reilly & Associates Inc., Sebastopol (2002) Widenius, M., Axmark, D.: MySQL Reference Manual. O’Reilly & Associates Inc., Sebastopol (2002)
42.
Zurück zum Zitat Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012) Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012)
43.
Zurück zum Zitat Zhang, W., et al.: Exploring memory hierarchy to improve scientific data read performance. In: CLUSTER 2015, pp. 66–69. IEEE (2015) Zhang, W., et al.: Exploring memory hierarchy to improve scientific data read performance. In: CLUSTER 2015, pp. 66–69. IEEE (2015)
44.
Zurück zum Zitat Zou, X., et al.: Parallel in situ detection of connected components in adaptive mesh refinement data. In: CCGrid 2015 (2015) Zou, X., et al.: Parallel in situ detection of connected components in adaptive mesh refinement data. In: CCGrid 2015 (2015)
Metadaten
Titel
SLOPE: Structural Locality-Aware Programming Model for Composing Array Data Analysis
verfasst von
Bin Dong
Kesheng Wu
Suren Byna
Houjun Tang
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-20656-7_4

Premium Partner