ABSTRACT
Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.
- Amazon Elastic Compute Cloud (Amazon EC2). http://www.amazon.com/gp/browse.html?node=201590011.Google Scholar
- Oceanic remote chemical analyzer (ORCA). http://armbrustlab.ocean.washington.edu/.Google Scholar
- M. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. Computers, IEEE Transactions on, C-36(5), 1987. Google ScholarDigital Library
- J. Boulos and K. Ono. Cost estimation of user-defined methods in object-relational database systems. SIGMOD Record, 28(3), 1999. Google ScholarDigital Library
- M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph. J., 292:371--394, May 1985.Google ScholarCross Ref
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarDigital Library
- A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):139, 2007. Google ScholarDigital Library
- K. Devine, E. Boman, R. Heapby, B. Hendrickson, and C. Vaughan. Zoltan data management service for parallel dynamic applications. Computing in Science and Engg., 4(2), 2002. Google ScholarDigital Library
- K. Devine, E. Boman, and G. Karypis. Partitioning and load balancing for emerging parallel applications and architectures, chapter 6. 2006.Google Scholar
- D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. CACM, 35(6), 1992. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In Proc. of the 18th VLDB Conf., 1992. Google ScholarDigital Library
- DeWitt et. al. Clustera: an integrated computation and data management system. In Proc. of the 34th VLDB Conf., 2008.Google Scholar
- J. M. Gelb and E. Bertschinger. Cold dark matter. 1: The formation of dark halos. Astroph. J., 436:467--490, Dec. 1994.Google ScholarCross Ref
- R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarDigital Library
- Hadoop. http://hadoop.apache.org/.Google Scholar
- Hive. http://hadoop.apache.org/hive/.Google Scholar
- B. Howe, D. Maier, and L. Bright. Smoothing the roi curve for scientific data management applications. In Proc. of the Third CIDR Conf., 2007.Google Scholar
- K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., 1991. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of the EuroSys Conf., 2007. {20} S. R. Knollmann and A. Knebe. AHF: Amiga's Halo Finder. Google ScholarDigital Library
- Astroph. J. Suppl., 182:608--624, June 2009.Google ScholarCross Ref
- Kwon et. al. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Technical Report UW-CSE-09-06-01, Dept. of Comp. Sci., Univ. of Washington, 2009.Google Scholar
- J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, volume i, 2009.Google Scholar
- R. P. Mount. The office of science data-management challenge. Technical report, Department of Energy, 2004.Google Scholar
- L. Oliker and R. Biswas. Plum: parallel load balancing for adaptive unstructured meshes. J. Parallel Distrib. Comput., 52(2), 1998. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of the SIGMOD Conf., 2008. Google ScholarDigital Library
- Oracle. http://www.oracle.com/database/.Google Scholar
- M. Parashar, H. Liu, Z. Li, V. Matossian, C. Schmidt, G. Zhang, and S. Hariri. AutoMate: Enabling Autonomic Applications on the Grid. Cluster Computing, 9(2), 2006. Google ScholarDigital Library
- Pavlo et. al. A comparison of approaches to large-scale data analysis. In Proc. of the SIGMOD Conf., 2009. Google ScholarDigital Library
- X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon. Cloud technologies for bioinformatics applications. In MTAGS '09: Proceedings of the 2nd Workshop on Many--Task Computing on Grids and Supercomputers, pages 1--10, 2009. Google ScholarDigital Library
- A. Shatdal and J. Naughton. Adaptive parallel aggregation algorithms. In Proc. of the SIGMOD Conf., 1995. Google ScholarDigital Library
- Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435: 629--636, June 2005.Google ScholarCross Ref
- Sql server. http://www.microsoft.com/sqlserver/.Google Scholar
- J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google Scholar
- A. Szalay and J. Gray. 2020 computing: Science in an exponential world. Nature, 440:413--414, 2006.Google ScholarCross Ref
- C. B. Walton, A. G. Dale, and R. M. Jenevein. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In Proc. of the 17th VLDB Conf., 1991. Google ScholarDigital Library
- R. S. Wei Li, D Gao. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., 2002. Google ScholarDigital Library
- D. H. Weinberg, L. Hernquist, and N. Katz. Photoionization, Numerical Resolution, and Galaxy Formation. Astroph. J., 477:8--+, Mar. 1997.Google ScholarCross Ref
- Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel DBMS. In VLDB, 2009. Google ScholarDigital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared--nothing systems. In Proc. of the SIGMOD Conf., 2008. Google ScholarDigital Library
- Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proc. of the 22nd SOSP Symp., 2009. Google ScholarDigital Library
- Yu et. al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of the 8th OSDI Symp., 2008. Google ScholarDigital Library
Index Terms
- Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Recommendations
Data Skew Profiling using HPCC Systems
ICBDE '19: Proceedings of the 2019 International Conference on Big Data and EducationOver the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the ...
Parallel processing of data from very large-scale wireless sensor networks
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingIn this paper we explore the problems of storing and reasoning about data collected from very large-scale wireless sensor networks (WSNs). Potential worldwide deployment of WSNs for, e.g., environmental monitoring purposes could yield data in amounts of ...
Building Reliable Data Pipelines for Managing Community Data Using Scientific Workflows
E-SCIENCE '09: Proceedings of the 2009 Fifth IEEE International Conference on e-ScienceThe growing amount of scientific data from sensors and field observations is posing a challenge to “data valets” responsible for managing them in data repositories. These repositories built on commodity clusters need to reliably ingest data continuously ...
Comments