research-article

Scalable and adaptive online joins

Authors:
Mohammed Elseidy

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

,
Abdallah Elguindy

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

,
Aleksandar Vitorovic

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

,
Christoph Koch

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 6pp 441–452https://doi.org/10.14778/2732279.2732281

Published:01 February 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly.

This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3:75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time.

References

The TPC-H benchmark. http://www.tpc.org/tpch/.Google Scholar
D. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In CIDR, pages 277--289, 2005.Google Scholar
F. Afrati and J. Ullman. Optimizing joins in a MapReduce environment. In EDBT, pages 99--110, 2010. Google ScholarDigital Library
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. STREAM: The Stanford data stream management system. Technical report, Stanford InfoLab, 2004.Google Scholar
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts. Linear road: a stream data management benchmark. In VLDB, pages 480--491, 2004. Google ScholarDigital Library
R. Avnur and J. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, pages 261--272, 2000. Google ScholarDigital Library
S. Babu and P. Bizarro. Adaptive query processing in the looking glass. In CIDR, pages 238--249, 2005.Google Scholar
S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
R. Fernandez, M. Migliavacca, E. Kalyvianaki and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages 725--736, 2013. Google ScholarDigital Library
S. Chaudhuri and V. Narasayya. TPC-D data generation with skew.Google Scholar
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 10--10, 2004. Google ScholarDigital Library
A. Deshpande and J. Hellerstein. Lifting the burden of history from adaptive query processing. In VLDB, pages 948--959, 2004. Google ScholarDigital Library
A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007. Google ScholarDigital Library
J. Dittrich, B. Seeger, D. Taylor, and P. Widmayer. Progressive merge join: a generic and non-blocking sort-based join algorithm. In VLDB, pages 299--310, 2002. Google ScholarDigital Library
M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. Scalable and adaptive online joins. EPFL-REPORT 190035 Technical Report, 2013.Google Scholar
A. Gounaris, N. Paton, A. Fernandes, and R. Sakellariou. Adaptive query processing: A survey. In British National Conference on Databases, pages 11--25, 2002. Google ScholarDigital Library
A. Gounaris, E. Tsamoura, and Y. Manolopoulos. Adaptive query processing in distributed settings. Advanced Query Processing, 36(1):211--236, 2012.Google Scholar
G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993. Google ScholarDigital Library
X. Gu, P. Yu, and H. Wang. Adaptive load diffusion for multiway windowed stream joins. In ICDE, pages 146--155, 2007.Google ScholarCross Ref
P. Haas and J. Hellerstein. Ripple joins for online aggregation. In SIGMOD, pages 287--298, 1999. Google ScholarDigital Library
J. Hellerstein, M. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2), 2000.Google Scholar
J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In SIGMOD, pages 171--182, 1997. Google ScholarDigital Library
Y. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD, pages 268--277, 1991. Google ScholarDigital Library
B. Liu, M. Jbantova, and E. Rundensteiner. Optimizing state-intensive non-blocking queries using run-time adaptation. In ICDE Workshop, page 614--623, 2007. Google ScholarDigital Library
M. Mokbel, M. Lu, and W. Aref. Hash-Merge join: A non-blocking join algorithm for producing fast and early join results. In ICDE, pages 251--262, 2004. Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
M. Olson, K. Bostic, and M. Seltzer. Berkeley DB. In Annual Technical Conference, USENIX, pages 43--43, 1999. Google ScholarDigital Library
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In Annual Technical Conference, USENIX, pages 267--273, 2008. Google ScholarDigital Library
N. Paton, J. Buenabad, M. Chen, V. Raman, G. Swart, I. Narang, D. Yellin, and A. Fernandes. Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options. VLDBJ, 18(1):119--140, 2009. Google ScholarDigital Library
D. Schneider and D. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, pages 110--121, 1989. Google ScholarDigital Library
M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25--36, 2002.Google Scholar
J. Stamos and H. Young. A symmetric fragment and replicate algorithm for distributed joins. Transactions on Parallel and Distributed Systems, 4(12):1345--1354, 1993. Google ScholarDigital Library
M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO - DB2's learning optimizer. In VLDB, pages 19--28, 2001. Google ScholarDigital Library
Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis. RPJ: producing fast join results on streams through rate-based optimization. In SIGMOD, pages 371--382, 2005. Google ScholarDigital Library
F. Tian and D. DeWitt. Tuple routing strategies for distributed eddies. In VLDB, pages 333--344, 2003. Google ScholarDigital Library
P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD, pages 241--252, 2011. Google ScholarDigital Library
T. Urhan and M. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, 23(2):27--33, 2000.Google Scholar
S. Wang and E. Rundensteiner. Scalable stream join processing with expensive predicates: workload distribution and adaptation by time-slicing. In EDBT, pages 299--310, 2009. Google ScholarDigital Library
A. Wilschut and P. Apers. Dataflow query execution in a parallel main-memory environment. In Parallel and Distributed Information Systems, pages 68--77, 1991. Google ScholarDigital Library
Y. Xing, S. Zdonik, and J. Hwang. Dynamic load distribution in the Borealis stream processor. In ICDE, pages 791--802, 2005. Google ScholarDigital Library
H. Yang, A. Dasdan, R. Hsiao, and D. Parker. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarDigital Library
X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using MapReduce. VLDBJ, 5(11):1184--1195, 2012. Google ScholarDigital Library
Y. Zhou, B. Ooi, and K. Tan. Dynamic load management for distributed continuous query systems. In ICDE, 2005. Google ScholarDigital Library

Index Terms

Scalable and adaptive online joins
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Adaptive load diffusion for stream joins
Middleware '05: Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware

Data stream processing has become increasingly important as many emerging applications call for sophisticated realtime processing over data streams, such as stock trading surveillance, network traffic monitoring, and sensor data analysis. Stream joins ...
Read More
Adaptive load diffusion for stream joins
Middleware'05: Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware

Data stream processing has become increasingly important as many emerging applications call for sophisticated realtime processing over data streams, such as stock trading surveillance, network traffic monitoring, and sensor data analysis. Stream joins ...
Read More
Adaptive load shedding for windowed stream joins
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

We present an adaptive load shedding approach for windowed stream joins. In contrast to the conventional approach of dropping tuples from the input streams, we explore the concept of selective processing for load shedding. We allow stream tuples to be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 6
February 2014
64 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 February 2014
Published in pvldb Volume 7, Issue 6
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 198
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable and adaptive online joins

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive load diffusion for stream joins

Adaptive load diffusion for stream joins

Adaptive load shedding for windowed stream joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable and adaptive online joins

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive load diffusion for stream joins

Adaptive load diffusion for stream joins

Adaptive load shedding for windowed stream joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media