research-article

Hybrid merge/overlap execution technique for parallel array processing

Authors:
Emad Soroush

University of Washington, Seattle

University of Washington, Seattle
View Profile

,
Magdalena Balazinska

University of Washington, Seattle

University of Washington, Seattle
View Profile

AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array DatabasesMarch 2011Pages 20–30https://doi.org/10.1145/1966895.1966898

Published:25 March 2011Publication History

AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases

Pages 20–30

ABSTRACT

Whether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in parallel. In this paper, we discuss different types of array operations and review how they can be processed in parallel using two different existing techniques. The first technique, which we call merge, consists in partitioning an array, processing the partitions in parallel, then merging the results to reconcile computations that span partition boundaries. The second technique, which we call overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase. We discuss when each technique can be applied to an array operation. We show that even for a single array operation, a different approach may yield the best performance for different regions of an array. Following this observation, we introduce a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. Through experiments on real, scientific data, we show that this hybrid approach outperforms the other two techniques.

References

http://mahout.apache.org/.Google Scholar
Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005. Google ScholarDigital Library
G. Berti. A calculus for stencils on arbitrary grids with applications to parallel PDE solution. In Proc. of GAMM Workshop "Discrete Modelling and discrete Algorithms in Continuum Mechanics", pages 37--46, 2001.Google Scholar
Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998. Google ScholarDigital Library
Cohen et. al. Mad skills: new analysis practices for big data. vldbj, 2(2):1481--1492, 2009. Google ScholarDigital Library
P. Cudre-Mauroux, H. Kimura, K-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. Zdonik, and P. Brown. SS-DB: A standard science DBMS benchmark. Under submission.Google Scholar
Cudre-Mauroux et. al. A demonstration of scidb: a science-oriented dbms. In Proc. of the 35th VLDB Conf., pages 1534--1537, 2009. Google ScholarDigital Library
M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph J, 292:371--394, May 1985.Google ScholarCross Ref
Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarDigital Library
Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998. Google ScholarDigital Library
S. Goil and A. Choudhary. Parsimony: An infrastructure for parallel multidimensional analysis and data mining. Journal of Parallel and Distributed Computing, pages 285--321, 2001. Google ScholarDigital Library
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. dmkd, 1(1):29--53, 1997. Google ScholarDigital Library
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010. Google ScholarDigital Library
Y. Kwon, D. Nunley, J. P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010. Google ScholarDigital Library
S. Loebman, D. Nunley, Y. Kwon, B. Howe, B. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage, 2009.Google ScholarCross Ref
LSST data management: DC3b processing flow. http://dev.lsstcorp.org/trac/wiki/DC3bProcessingFlow.Google Scholar
J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput., June 1996. Google ScholarDigital Library
R. W. Numrich and J Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, August 1998. Google ScholarDigital Library
Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001. Google ScholarDigital Library
Reed et. al. Evolution of the mass function of dark matter haloes. Monthly Notices of the Royal Astronomical Society, 346:565--572, December 2003.Google ScholarCross Ref
J. Rogers, R. Simakov, E. Soroush, P. Velikhov, M. Balazinska, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, S. Zdonik, A. Smirnov, K. Knizhnik, and Paul G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010. Google ScholarDigital Library
Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994. Google ScholarDigital Library
Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994. Google ScholarDigital Library
E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for complex parallel array processing. In Proc. of the SIGMOD Conf., 2011. Google ScholarDigital Library
Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435:629--636, June 2005.Google ScholarCross Ref
J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google Scholar
Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.Google Scholar
Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006. Google ScholarDigital Library
http://upc.lbl.gov/.Google Scholar
Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google Scholar

Index Terms

Hybrid merge/overlap execution technique for parallel array processing
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Recommendations

ArrayStore: a storage manager for complex parallel array processing
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting ...
Read More
Array processing on an array processor
Proceedings of the conference on Programming languages and compilers for parallel and vector machines

Central memory is distributed across several processing elements on the ILLIAC-IV and similar array processors. This causes memory to appear two dimensional and raises special problems in the handling of arrays. Assignment of arrays to storage, and ...
Read More
Wavefield modeling and array processing .II. Algorithms

For pt.I see ibid. vol.42, no.10, p.2549, 1994. We present several novel algorithms for array processing, as well as extensions of existing methods based on this approach. We first consider the problem of localization of wideband sources via coherent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
March 2011
53 pages
ISBN:9781450306140
DOI:10.1145/1966895
Editors:
Peter Baumann
Jacobs University Bremen, Germany
,
Bill Howe
University of Washington
,
Kjell Orsborn
Uppsala University, Sweden
,
Silvia Stefanova
Uppsala University, Sweden
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
overlap execution strategy
parallel array processing
scientific databases
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 165
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hybrid merge/overlap execution technique for parallel array processing

AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases

ABSTRACT

References

Cited By

Index Terms

Recommendations

ArrayStore: a storage manager for complex parallel array processing

Array processing on an array processor

Wavefield modeling and array processing .II. Algorithms