research-article

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Authors:
YongChul Kwon

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Magdalena Balazinska

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Bill Howe

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Jerome Rolia

HP Labs, Bristol, UNK, United Kingdom

HP Labs, Bristol, UNK, United Kingdom
View Profile

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computingJune 2010Pages 75–86https://doi.org/10.1145/1807128.1807140

Published:10 June 2010Publication History

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

Pages 75–86

ABSTRACT

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.

References

Amazon Elastic Compute Cloud (Amazon EC2). http://www.amazon.com/gp/browse.html?node=201590011.Google Scholar
Oceanic remote chemical analyzer (ORCA). http://armbrustlab.ocean.washington.edu/.Google Scholar
M. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. Computers, IEEE Transactions on, C-36(5), 1987. Google ScholarDigital Library
J. Boulos and K. Ono. Cost estimation of user-defined methods in object-relational database systems. SIGMOD Record, 28(3), 1999. Google ScholarDigital Library
M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White. The evolution of large-scale structure in a universe dominated by cold dark matter. Astroph. J., 292:371--394, May 1985.Google ScholarCross Ref
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th OSDI Symp., 2004. Google ScholarDigital Library
A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):139, 2007. Google ScholarDigital Library
K. Devine, E. Boman, R. Heapby, B. Hendrickson, and C. Vaughan. Zoltan data management service for parallel dynamic applications. Computing in Science and Engg., 4(2), 2002. Google ScholarDigital Library
K. Devine, E. Boman, and G. Karypis. Partitioning and load balancing for emerging parallel applications and architectures, chapter 6. 2006.Google Scholar
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. CACM, 35(6), 1992. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In Proc. of the 18th VLDB Conf., 1992. Google ScholarDigital Library
DeWitt et. al. Clustera: an integrated computation and data management system. In Proc. of the 34th VLDB Conf., 2008.Google Scholar
J. M. Gelb and E. Bertschinger. Cold dark matter. 1: The formation of dark halos. Astroph. J., 436:467--490, Dec. 1994.Google ScholarCross Ref
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarDigital Library
Hadoop. http://hadoop.apache.org/.Google Scholar
Hive. http://hadoop.apache.org/hive/.Google Scholar
B. Howe, D. Maier, and L. Bright. Smoothing the roi curve for scientific data management applications. In Proc. of the Third CIDR Conf., 2007.Google Scholar
K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., 1991. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of the EuroSys Conf., 2007. {20} S. R. Knollmann and A. Knebe. AHF: Amiga's Halo Finder. Google ScholarDigital Library
Astroph. J. Suppl., 182:608--624, June 2009.Google ScholarCross Ref
Kwon et. al. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Technical Report UW-CSE-09-06-01, Dept. of Comp. Sci., Univ. of Washington, 2009.Google Scholar
J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, volume i, 2009.Google Scholar
R. P. Mount. The office of science data-management challenge. Technical report, Department of Energy, 2004.Google Scholar
L. Oliker and R. Biswas. Plum: parallel load balancing for adaptive unstructured meshes. J. Parallel Distrib. Comput., 52(2), 1998. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of the SIGMOD Conf., 2008. Google ScholarDigital Library
Oracle. http://www.oracle.com/database/.Google Scholar
M. Parashar, H. Liu, Z. Li, V. Matossian, C. Schmidt, G. Zhang, and S. Hariri. AutoMate: Enabling Autonomic Applications on the Grid. Cluster Computing, 9(2), 2006. Google ScholarDigital Library
Pavlo et. al. A comparison of approaches to large-scale data analysis. In Proc. of the SIGMOD Conf., 2009. Google ScholarDigital Library
X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon. Cloud technologies for bioinformatics applications. In MTAGS '09: Proceedings of the 2nd Workshop on Many--Task Computing on Grids and Supercomputers, pages 1--10, 2009. Google ScholarDigital Library
A. Shatdal and J. Naughton. Adaptive parallel aggregation algorithms. In Proc. of the SIGMOD Conf., 1995. Google ScholarDigital Library
Springel et. al. Simulations of the formation, evolution and clustering of galaxies and quasars. NATURE, 435: 629--636, June 2005.Google ScholarCross Ref
Sql server. http://www.microsoft.com/sqlserver/.Google Scholar
J. G. Stadel. Cosmological N-body simulations and their analysis. PhD thesis, University of Washington, 2001.Google Scholar
A. Szalay and J. Gray. 2020 computing: Science in an exponential world. Nature, 440:413--414, 2006.Google ScholarCross Ref
C. B. Walton, A. G. Dale, and R. M. Jenevein. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In Proc. of the 17th VLDB Conf., 1991. Google ScholarDigital Library
R. S. Wei Li, D Gao. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., 2002. Google ScholarDigital Library
D. H. Weinberg, L. Hernquist, and N. Katz. Photoionization, Numerical Resolution, and Galaxy Formation. Astroph. J., 477:8--+, Mar. 1997.Google ScholarCross Ref
Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel DBMS. In VLDB, 2009. Google ScholarDigital Library
Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared--nothing systems. In Proc. of the SIGMOD Conf., 2008. Google ScholarDigital Library
Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proc. of the 22nd SOSP Symp., 2009. Google ScholarDigital Library
Yu et. al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of the 8th OSDI Symp., 2008. Google ScholarDigital Library

Index Terms

Skew-resistant parallel processing of feature-extracting scientific user-defined functions
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Data Skew Profiling using HPCC Systems
ICBDE '19: Proceedings of the 2019 International Conference on Big Data and Education

Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the ...
Read More
Parallel processing of data from very large-scale wireless sensor networks
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

In this paper we explore the problems of storing and reasoning about data collected from very large-scale wireless sensor networks (WSNs). Potential worldwide deployment of WSNs for, e.g., environmental monitoring purposes could yield data in amounts of ...
Read More
Building Reliable Data Pipelines for Managing Community Data Using Scientific Workflows
E-SCIENCE '09: Proceedings of the 2009 Fifth IEEE International Conference on e-Science

The growing amount of scientific data from sensors and field observations is posing a challenge to “data valets” responsible for managing them in data repositories. These repositories built on commodity clusters need to reliably ingest data continuously ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing
June 2010
264 pages
ISBN:9781450300360
DOI:10.1145/1807128
General Chair:
Joseph M. Hellerstein
University of California, Berkeley, USA
,
Program Chairs:
Surajit Chaudhuri
Microsoft Research, USA
,
Mendel Rosenblum
Stanford University, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mapreduce
parallel processing
partition
scientific data management
skew
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 110
  Total Citations
  View Citations
- 899
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Skew Profiling using HPCC Systems

Parallel processing of data from very large-scale wireless sensor networks

Building Reliable Data Pipelines for Managing Community Data Using Scientific Workflows