skip to main content
10.1145/1966895.1966898acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadConference Proceedingsconference-collections
research-article

Hybrid merge/overlap execution technique for parallel array processing

Published:25 March 2011Publication History

ABSTRACT

Whether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in parallel. In this paper, we discuss different types of array operations and review how they can be processed in parallel using two different existing techniques. The first technique, which we call merge, consists in partitioning an array, processing the partitions in parallel, then merging the results to reconcile computations that span partition boundaries. The second technique, which we call overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase. We discuss when each technique can be applied to an array operation. We show that even for a single array operation, a different approach may yield the best performance for different regions of an array. Following this observation, we introduce a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. Through experiments on real, scientific data, we show that this hybrid approach outperforms the other two techniques.

References

  1. http://mahout.apache.org/.Google ScholarGoogle Scholar
  2. Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Berti. A calculus for stencils on arbitrary grids with applications to parallel PDE solution. In Proc. of GAMM Workshop "Discrete Modelling and discrete Algorithms in Continuum Mechanics", pages 37--46, 2001.Google ScholarGoogle Scholar
  4. Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cohen et. al. Mad skills: new analysis practices for big data. vldbj, 2(2):1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Cudre-Mauroux, H. Kimura, K-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. Zdonik, and P. Brown. SS-DB: A standard science DBMS benchmark. Under submission.Google ScholarGoogle Scholar
  7. Cudre-Mauroux et. al. A demonstration of scidb: a science-oriented dbms. In Proc. of the 35th VLDB Conf., pages 1534--1537, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph J, 292:371--394, May 1985.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Goil and A. Choudhary. Parsimony: An infrastructure for parallel multidimensional analysis and data mining. Journal of Parallel and Distributed Computing, pages 285--321, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. dmkd, 1(1):29--53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Kwon, D. Nunley, J. P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Loebman, D. Nunley, Y. Kwon, B. Howe, B. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  16. LSST data management: DC3b processing flow. http://dev.lsstcorp.org/trac/wiki/DC3bProcessingFlow.Google ScholarGoogle Scholar
  17. J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput., June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. W. Numrich and J Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, August 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Reed et. al. Evolution of the mass function of dark matter haloes. Monthly Notices of the Royal Astronomical Society, 346:565--572, December 2003.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Rogers, R. Simakov, E. Soroush, P. Velikhov, M. Balazinska, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, S. Zdonik, A. Smirnov, K. Knizhnik, and Paul G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for complex parallel array processing. In Proc. of the SIGMOD Conf., 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435:629--636, June 2005.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google ScholarGoogle Scholar
  27. Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.Google ScholarGoogle Scholar
  28. Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. http://upc.lbl.gov/.Google ScholarGoogle Scholar
  30. Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google ScholarGoogle Scholar

Index Terms

  1. Hybrid merge/overlap execution technique for parallel array processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
        March 2011
        53 pages
        ISBN:9781450306140
        DOI:10.1145/1966895

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader