skip to main content
10.1145/1807128.1807140acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Published:10 June 2010Publication History

ABSTRACT

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.

References

  1. Amazon Elastic Compute Cloud (Amazon EC2). http://www.amazon.com/gp/browse.html?node=201590011.Google ScholarGoogle Scholar
  2. Oceanic remote chemical analyzer (ORCA). http://armbrustlab.ocean.washington.edu/.Google ScholarGoogle Scholar
  3. M. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. Computers, IEEE Transactions on, C-36(5), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Boulos and K. Ono. Cost estimation of user-defined methods in object-relational database systems. SIGMOD Record, 28(3), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph. J., 292:371--394, May 1985.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):139, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Devine, E. Boman, R. Heapby, B. Hendrickson, and C. Vaughan. Zoltan data management service for parallel dynamic applications. Computing in Science and Engg., 4(2), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Devine, E. Boman, and G. Karypis. Partitioning and load balancing for emerging parallel applications and architectures, chapter 6. 2006.Google ScholarGoogle Scholar
  10. D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. CACM, 35(6), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In Proc. of the 18th VLDB Conf., 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DeWitt et. al. Clustera: an integrated computation and data management system. In Proc. of the 34th VLDB Conf., 2008.Google ScholarGoogle Scholar
  13. J. M. Gelb and E. Bertschinger. Cold dark matter. 1: The formation of dark halos. Astroph. J., 436:467--490, Dec. 1994.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  16. Hive. http://hadoop.apache.org/hive/.Google ScholarGoogle Scholar
  17. B. Howe, D. Maier, and L. Bright. Smoothing the roi curve for scientific data management applications. In Proc. of the Third CIDR Conf., 2007.Google ScholarGoogle Scholar
  18. K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of the EuroSys Conf., 2007. {20} S. R. Knollmann and A. Knebe. AHF: Amiga's Halo Finder. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Astroph. J. Suppl., 182:608--624, June 2009.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kwon et. al. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Technical Report UW-CSE-09-06-01, Dept. of Comp. Sci., Univ. of Washington, 2009.Google ScholarGoogle Scholar
  22. J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, volume i, 2009.Google ScholarGoogle Scholar
  23. R. P. Mount. The office of science data-management challenge. Technical report, Department of Energy, 2004.Google ScholarGoogle Scholar
  24. L. Oliker and R. Biswas. Plum: parallel load balancing for adaptive unstructured meshes. J. Parallel Distrib. Comput., 52(2), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of the SIGMOD Conf., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Oracle. http://www.oracle.com/database/.Google ScholarGoogle Scholar
  27. M. Parashar, H. Liu, Z. Li, V. Matossian, C. Schmidt, G. Zhang, and S. Hariri. AutoMate: Enabling Autonomic Applications on the Grid. Cluster Computing, 9(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pavlo et. al. A comparison of approaches to large-scale data analysis. In Proc. of the SIGMOD Conf., 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon. Cloud technologies for bioinformatics applications. In MTAGS '09: Proceedings of the 2nd Workshop on Many--Task Computing on Grids and Supercomputers, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Shatdal and J. Naughton. Adaptive parallel aggregation algorithms. In Proc. of the SIGMOD Conf., 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435: 629--636, June 2005.Google ScholarGoogle ScholarCross RefCross Ref
  32. Sql server. http://www.microsoft.com/sqlserver/.Google ScholarGoogle Scholar
  33. J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google ScholarGoogle Scholar
  34. A. Szalay and J. Gray. 2020 computing: Science in an exponential world. Nature, 440:413--414, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  35. C. B. Walton, A. G. Dale, and R. M. Jenevein. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In Proc. of the 17th VLDB Conf., 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. S. Wei Li, D Gao. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. H. Weinberg, L. Hernquist, and N. Katz. Photoionization, Numerical Resolution, and Galaxy Formation. Astroph. J., 477:8--+, Mar. 1997.Google ScholarGoogle ScholarCross RefCross Ref
  38. Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel DBMS. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared--nothing systems. In Proc. of the SIGMOD Conf., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proc. of the 22nd SOSP Symp., 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yu et. al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of the 8th OSDI Symp., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Skew-resistant parallel processing of feature-extracting scientific user-defined functions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing
      June 2010
      264 pages
      ISBN:9781450300360
      DOI:10.1145/1807128

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate169of722submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader