research-article

AJoin: ad-hoc stream joins at scale

Authors:
Jeyhun Karimov

DFKI GmbH

DFKI GmbH
View Profile

,
Tilmann Rabl

University of Potsdam

University of Potsdam
View Profile

,
Volker Markl

DFKI GmbH, TU Berlin

DFKI GmbH, TU Berlin
View Profile

Proceedings of the VLDB Endowment Volume 13 Issue 4pp 435–448https://doi.org/10.14778/3372716.3372718

Published:09 December 2019Publication History

Proceedings of the VLDB Endowment

Abstract

The processing model of state-of-the-art stream processing engines is designed to execute long-running queries one at a time. However, with the advance of cloud technologies and multi-tenant systems, multiple users share the same cloud for stream query processing. This results in many ad-hoc stream queries sharing common stream sources. Many of these queries include joins.

There are two main limitations that hinder performing ad-hoc stream join processing. The first limitation is missed optimization potential both in stream data processing and query optimization layers. The second limitation is the lack of dynamicity in query execution plans.

We present AJoin, a dynamic and incremental ad-hoc stream join framework. AJoin consists of an optimization layer and a stream data processing layer. The optimization layer periodically reoptimizes the query execution plan, performing join reordering and vertical and horizontal scaling at run-time without stopping the execution. The data processing layer implements pipelineparallel join architecture. This layer enables incremental and consistent query processing supporting all the actions triggered by the optimizer. We implement AJoin on top of Apache Flink, an open-source data processing framework. AJoin outperforms Flink not only at ad-hoc multi-query workloads but also at single-query workloads.

References

Apache Kafka. https://kafka.apache.org/, 2019. [Online: accessed 17-August-2019].Google Scholar
D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245--248, 2009.Google ScholarDigital Library
C. Anglano, M. Canonico, and M. Guazzone. Fc2q: exploiting fuzzy control in server consolidation for cloud applications with sla constraints. Concurrency and Computation: Practice and Experience, 27(17):4491--4514, 2015.Google ScholarDigital Library
M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative api for real-time applications in apache spark. In Proceedings of the 2018 International Conference on Management of Data, pages 601--613. ACM, 2018.Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383--1394. ACM, 2015.Google ScholarDigital Library
S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 519--530. ACM, 2010.Google ScholarDigital Library
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMOD record, volume 29, pages 261--272. ACM, 2000.Google Scholar
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1--16. ACM, 2002.Google ScholarDigital Library
A. Beitch, B. Liu, T. Yung, R. Griffith, A. Fox, D. A. Patterson, et al. Rain: A workload generation toolkit for cloud computing applications. University of California, Tech. Rep. UCB/EECS-2010-14, 2010.Google Scholar
S. Bradshaw and P. Howard. Troops, trolls and troublemakers: A global inventory of organized social media manipulation. 2017.Google Scholar
L. Braun, T. Etter, G. Gasparis, M. Kaufmann, D. Kossmann, D. Widmer, A. Avitzur, A. Iliopoulos, E. Levy, and N. Liang. Analytics in motion: High performance event-processing and real-time analytics in the same database. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages251--264. ACM, 2015.Google ScholarDigital Library
G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.Google ScholarDigital Library
G. Candea, N. Polyzotis, and R. Vingralek. Predictable performance and high query concurrency for data analytics. The VLDB Journal, 20(2):227--248, 2011.Google ScholarDigital Library
P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in apache flink®: consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.Google ScholarDigital Library
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
V. Cardellini, M. Nardelli, and D. Luzi. Elastic stateful stream processing in storm. In High Performance Computing & Simulation (HPCS), 2016 International Conference on, pages 583--590. IEEE, 2016.Google ScholarCross Ref
Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer. Path sharing and predicate evaluation for high-performance xml filtering. ACM Transactions on Database Systems (TODS), 28(4):467--516, 2003.Google Scholar
T. Dokeroglu, S. Ozal, M. A. Bayir, M. S. Cinar, and A. Cosar. Improving the performance of hadoop hive by sharing scan and computation tasks. Journal of Cloud Computing, 3(1):12, 2014.Google ScholarDigital Library
R. Ebenstein, N. Kamat, and A. Nandi. Fluxquery: An execution framework for highly interactive query workloads. In Proceedings of the 2016 International Conference on Management of Data, pages 1333--1345. ACM, 2016.Google ScholarDigital Library
A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica. Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28(13):2009, 2009.Google Scholar
B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel & Distributed Systems, (1):1--1, 2014.Google ScholarDigital Library
G. Giannikis, G. Alonso, and D. Kossmann. SharedDB: killing one thousand queries with one stone. PVLDB, 5(6):526--537, 2012.Google ScholarDigital Library
G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Workload optimization using shareddb. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1045--1048. ACM, 2013.Google ScholarDigital Library
G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Shared workload optimization. PVLDB, 7(6):429--440, 2014.Google ScholarDigital Library
J. Giceva, G. Alonso, T. Roscoe, and T. Harris. Deployment of query plans on multicores. PVLDB, 8(3):233--244, 2014.Google ScholarDigital Library
M. A. Hammad, M. J. Franklin, W. G. Aref, and A. K. Elmagarmid. Scheduling for shared window joins over data streams. PVLDB, 29:297--308, 2003.Google Scholar
T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer. Latency-aware elastic scaling for distributed data stream processing systems. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, pages 13--22. ACM, 2014.Google ScholarDigital Library
T. Heinze, Y. Ji, L. Roediger, V. Pappalardo, A. Meister, Z. Jerzak, and C. Fetzer. Fugu: Elastic data stream processing with latency constraints. IEEE Data Eng. Bull., 38(4):73--81, 2015.Google Scholar
T. Heinze, L. Roediger, A. Meister, Y. Ji, Z. Jerzak, and C. Fetzer. Online parameter optimization for elastic data stream processing. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 276--287. ACM, 2015.Google ScholarDigital Library
N. R. Herbst, S. Kounev, et al. Modeling variations in load intensity over time. In Proceedings of the third international workshop on Large scale testing, pages 1--4. ACM, 2014.Google Scholar
T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Transactions on Database Systems (TODS), 9(3):482--502, 1984.Google Scholar
G. Jacques-Silva, R. Lei, L. Cheng, G. J. Chen, K. Ching, T. Hu, Y. Mei, K. Wilfong, R. Shetty, S. Yilmaz, et al. Providing streaming joins as a service at facebook. PVLDB, 11(12):1809--1821, 2018.Google ScholarDigital Library
J. Karimov. Stream Benchmarks, pages 1--6. Springer International Publishing, Cham, 2018.Google Scholar
J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In IEEE 34th International Conference on Data Engineering (ICDE), pages 1507--1518. IEEE, 2018.Google ScholarCross Ref
J. Karimov, T. Rabl, and V. Markl. Astream: Ad-hoc shared stream processing. In SIGMOD 2019. ACM, 2019.Google ScholarDigital Library
M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1):63--74, 1983.Google ScholarDigital Library
D. Kossmann and K. Stocker. Iterative dynamic programming: a new class of query optimization algorithms. ACM Transactions on Database Systems (TODS), 25(1):43--82, 2000.Google Scholar
R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries. PVLDB, 86:128--137, 1986.Google Scholar
Q. Li, M. Shao, V. Markl, K. Beyer, L. Colby, and G. Lohman. Adaptively reordering joins during query execution. In IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007., pages 26--35. IEEE, 2007.Google ScholarCross Ref
L. Lu, X. Zhu, R. Griffith, P. Padala, A. Parikh, P. Shah, and E. Smirni. Application-driven dynamic vertical scaling of virtual machines in resource pools. In 2014 IEEE Network Operations and Management Symposium (NOMS), pages 1--9. IEEE, 2014.Google ScholarCross Ref
D. Makreshanski, G. Giannikis, G. Alonso, and D. Kossmann. Mqjoin: efficient shared execution of main-memory joins. PVLDB, 9(6):480--491, 2016.Google ScholarDigital Library
D. Makreshanski, J. Giceva, C. Barthels, and G. Alonso. Batchdb: Efficient isolated execution of hybrid oltp+ olap workloads for interactive applications. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 37--50. ACM, 2017.Google ScholarDigital Library
V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 659--670. ACM, 2004.Google ScholarDigital Library
G. Moerkotte and T. Neumann. Dynamic programming strikes back. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 539--552. ACM, 2008.Google ScholarDigital Library
W. Phillips. Meet the trolls. Index on Censorship, 40(2):68--76, 2011.Google ScholarCross Ref
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 23--34. ACM, 1979.Google ScholarDigital Library
B. Suleiman, S. Sakr, R. Jeffery, and A. Liu. On understanding the economics and elasticity challenges of deploying business applications on public cloud infrastructure. Journal of Internet Services and Applications, 3(2):173--193, 2012.Google ScholarCross Ref
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 147--156. ACM, 2014.Google ScholarDigital Library
I. Trummer and C. Koch. Solving the join ordering problem via mixed integer linear programming. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1025--1040. ACM, 2017.Google ScholarDigital Library
A. Turner, A. Fox, J. Payne, and H. S. Kim. C-mart: Benchmarking the cloud. IEEE Transactions on Parallel and Distributed Systems, 24(6):1256--1266, 2012.Google ScholarDigital Library
M. Turner, D. Budgen, and P. Brereton. Turning software into a service. Computer, 36(10):38--44, 2003.Google ScholarDigital Library
S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. PVLDB, 29:285--296, 2003.Google Scholar
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages423--438. ACM, 2013.Google ScholarDigital Library
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the, 2012.Google Scholar
S. Zeuch, B. Del Monte, J. Karimov, C. Lutz, M. Renz, J. Traub, S. Breß, T. Rabl, and V. Markl. Analyzing efficient stream processing on modern hardware. PVLDB, 12(5):516--530, 2019.Google ScholarDigital Library

Index Terms

AJoin: ad-hoc stream joins at scale
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Read More
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Read More
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 13, Issue 4
December 2019
167 pages
ISSN:2150-8097
Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 9 December 2019
Published in pvldb Volume 13, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 329
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AJoin: ad-hoc stream joins at scale

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Distributed stream join query processing with semijoins

Equivalence and minimization of conjunctive queries under combined semantics

Scalable and efficient processing of top-k multiple-type integrated queries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

AJoin: ad-hoc stream joins at scale

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Distributed stream join query processing with semijoins

Equivalence and minimization of conjunctive queries under combined semantics

Scalable and efficient processing of top-k multiple-type integrated queries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media