ABSTRACT
Whether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in parallel. In this paper, we discuss different types of array operations and review how they can be processed in parallel using two different existing techniques. The first technique, which we call merge, consists in partitioning an array, processing the partitions in parallel, then merging the results to reconcile computations that span partition boundaries. The second technique, which we call overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase. We discuss when each technique can be applied to an array operation. We show that even for a single array operation, a different approach may yield the best performance for different regions of an array. Following this observation, we introduce a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. Through experiments on real, scientific data, we show that this hybrid approach outperforms the other two techniques.
- http://mahout.apache.org/.Google Scholar
- Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005. Google ScholarDigital Library
- G. Berti. A calculus for stencils on arbitrary grids with applications to parallel PDE solution. In Proc. of GAMM Workshop "Discrete Modelling and discrete Algorithms in Continuum Mechanics", pages 37--46, 2001.Google Scholar
- Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998. Google ScholarDigital Library
- Cohen et. al. Mad skills: new analysis practices for big data. vldbj, 2(2):1481--1492, 2009. Google ScholarDigital Library
- P. Cudre-Mauroux, H. Kimura, K-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. Zdonik, and P. Brown. SS-DB: A standard science DBMS benchmark. Under submission.Google Scholar
- Cudre-Mauroux et. al. A demonstration of scidb: a science-oriented dbms. In Proc. of the 35th VLDB Conf., pages 1534--1537, 2009. Google ScholarDigital Library
- M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph J, 292:371--394, May 1985.Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarDigital Library
- Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998. Google ScholarDigital Library
- S. Goil and A. Choudhary. Parsimony: An infrastructure for parallel multidimensional analysis and data mining. Journal of Parallel and Distributed Computing, pages 285--321, 2001. Google ScholarDigital Library
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. dmkd, 1(1):29--53, 1997. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010. Google ScholarDigital Library
- Y. Kwon, D. Nunley, J. P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010. Google ScholarDigital Library
- S. Loebman, D. Nunley, Y. Kwon, B. Howe, B. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage, 2009.Google ScholarCross Ref
- LSST data management: DC3b processing flow. http://dev.lsstcorp.org/trac/wiki/DC3bProcessingFlow.Google Scholar
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput., June 1996. Google ScholarDigital Library
- R. W. Numrich and J Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, August 1998. Google ScholarDigital Library
- Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001. Google ScholarDigital Library
- Reed et. al. Evolution of the mass function of dark matter haloes. Monthly Notices of the Royal Astronomical Society, 346:565--572, December 2003.Google ScholarCross Ref
- J. Rogers, R. Simakov, E. Soroush, P. Velikhov, M. Balazinska, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, S. Zdonik, A. Smirnov, K. Knizhnik, and Paul G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010. Google ScholarDigital Library
- Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994. Google ScholarDigital Library
- Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994. Google ScholarDigital Library
- E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for complex parallel array processing. In Proc. of the SIGMOD Conf., 2011. Google ScholarDigital Library
- Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435:629--636, June 2005.Google ScholarCross Ref
- J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google Scholar
- Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.Google Scholar
- Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006. Google ScholarDigital Library
- http://upc.lbl.gov/.Google Scholar
- Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google Scholar
Index Terms
- Hybrid merge/overlap execution technique for parallel array processing
Recommendations
ArrayStore: a storage manager for complex parallel array processing
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataWe present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting ...
Array processing on an array processor
Proceedings of the conference on Programming languages and compilers for parallel and vector machinesCentral memory is distributed across several processing elements on the ILLIAC-IV and similar array processors. This causes memory to appear two dimensional and raises special problems in the handling of arrays. Assignment of arrays to storage, and ...
Wavefield modeling and array processing .II. Algorithms
For pt.I see ibid. vol.42, no.10, p.2549, 1994. We present several novel algorithms for array processing, as well as extensions of existing methods based on this approach. We first consider the problem of localization of wideband sources via coherent ...
Comments