Abstract
The processing model of state-of-the-art stream processing engines is designed to execute long-running queries one at a time. However, with the advance of cloud technologies and multi-tenant systems, multiple users share the same cloud for stream query processing. This results in many ad-hoc stream queries sharing common stream sources. Many of these queries include joins.
There are two main limitations that hinder performing ad-hoc stream join processing. The first limitation is missed optimization potential both in stream data processing and query optimization layers. The second limitation is the lack of dynamicity in query execution plans.
We present AJoin, a dynamic and incremental ad-hoc stream join framework. AJoin consists of an optimization layer and a stream data processing layer. The optimization layer periodically reoptimizes the query execution plan, performing join reordering and vertical and horizontal scaling at run-time without stopping the execution. The data processing layer implements pipelineparallel join architecture. This layer enables incremental and consistent query processing supporting all the actions triggered by the optimizer. We implement AJoin on top of Apache Flink, an open-source data processing framework. AJoin outperforms Flink not only at ad-hoc multi-query workloads but also at single-query workloads.
- Apache Kafka. https://kafka.apache.org/, 2019. [Online: accessed 17-August-2019].Google Scholar
- D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245--248, 2009.Google ScholarDigital Library
- C. Anglano, M. Canonico, and M. Guazzone. Fc2q: exploiting fuzzy control in server consolidation for cloud applications with sla constraints. Concurrency and Computation: Practice and Experience, 27(17):4491--4514, 2015.Google ScholarDigital Library
- M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative api for real-time applications in apache spark. In Proceedings of the 2018 International Conference on Management of Data, pages 601--613. ACM, 2018.Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383--1394. ACM, 2015.Google ScholarDigital Library
- S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 519--530. ACM, 2010.Google ScholarDigital Library
- R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMOD record, volume 29, pages 261--272. ACM, 2000.Google Scholar
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1--16. ACM, 2002.Google ScholarDigital Library
- A. Beitch, B. Liu, T. Yung, R. Griffith, A. Fox, D. A. Patterson, et al. Rain: A workload generation toolkit for cloud computing applications. University of California, Tech. Rep. UCB/EECS-2010-14, 2010.Google Scholar
- S. Bradshaw and P. Howard. Troops, trolls and troublemakers: A global inventory of organized social media manipulation. 2017.Google Scholar
- L. Braun, T. Etter, G. Gasparis, M. Kaufmann, D. Kossmann, D. Widmer, A. Avitzur, A. Iliopoulos, E. Levy, and N. Liang. Analytics in motion: High performance event-processing and real-time analytics in the same database. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages251--264. ACM, 2015.Google ScholarDigital Library
- G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.Google ScholarDigital Library
- G. Candea, N. Polyzotis, and R. Vingralek. Predictable performance and high query concurrency for data analytics. The VLDB Journal, 20(2):227--248, 2011.Google ScholarDigital Library
- P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in apache flink®: consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.Google ScholarDigital Library
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
- V. Cardellini, M. Nardelli, and D. Luzi. Elastic stateful stream processing in storm. In High Performance Computing & Simulation (HPCS), 2016 International Conference on, pages 583--590. IEEE, 2016.Google ScholarCross Ref
- Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer. Path sharing and predicate evaluation for high-performance xml filtering. ACM Transactions on Database Systems (TODS), 28(4):467--516, 2003.Google Scholar
- T. Dokeroglu, S. Ozal, M. A. Bayir, M. S. Cinar, and A. Cosar. Improving the performance of hadoop hive by sharing scan and computation tasks. Journal of Cloud Computing, 3(1):12, 2014.Google ScholarDigital Library
- R. Ebenstein, N. Kamat, and A. Nandi. Fluxquery: An execution framework for highly interactive query workloads. In Proceedings of the 2016 International Conference on Management of Data, pages 1333--1345. ACM, 2016.Google ScholarDigital Library
- A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica. Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28(13):2009, 2009.Google Scholar
- B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel & Distributed Systems, (1):1--1, 2014.Google ScholarDigital Library
- G. Giannikis, G. Alonso, and D. Kossmann. SharedDB: killing one thousand queries with one stone. PVLDB, 5(6):526--537, 2012.Google ScholarDigital Library
- G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Workload optimization using shareddb. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1045--1048. ACM, 2013.Google ScholarDigital Library
- G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Shared workload optimization. PVLDB, 7(6):429--440, 2014.Google ScholarDigital Library
- J. Giceva, G. Alonso, T. Roscoe, and T. Harris. Deployment of query plans on multicores. PVLDB, 8(3):233--244, 2014.Google ScholarDigital Library
- M. A. Hammad, M. J. Franklin, W. G. Aref, and A. K. Elmagarmid. Scheduling for shared window joins over data streams. PVLDB, 29:297--308, 2003.Google Scholar
- T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer. Latency-aware elastic scaling for distributed data stream processing systems. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, pages 13--22. ACM, 2014.Google ScholarDigital Library
- T. Heinze, Y. Ji, L. Roediger, V. Pappalardo, A. Meister, Z. Jerzak, and C. Fetzer. Fugu: Elastic data stream processing with latency constraints. IEEE Data Eng. Bull., 38(4):73--81, 2015.Google Scholar
- T. Heinze, L. Roediger, A. Meister, Y. Ji, Z. Jerzak, and C. Fetzer. Online parameter optimization for elastic data stream processing. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 276--287. ACM, 2015.Google ScholarDigital Library
- N. R. Herbst, S. Kounev, et al. Modeling variations in load intensity over time. In Proceedings of the third international workshop on Large scale testing, pages 1--4. ACM, 2014.Google Scholar
- T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Transactions on Database Systems (TODS), 9(3):482--502, 1984.Google Scholar
- G. Jacques-Silva, R. Lei, L. Cheng, G. J. Chen, K. Ching, T. Hu, Y. Mei, K. Wilfong, R. Shetty, S. Yilmaz, et al. Providing streaming joins as a service at facebook. PVLDB, 11(12):1809--1821, 2018.Google ScholarDigital Library
- J. Karimov. Stream Benchmarks, pages 1--6. Springer International Publishing, Cham, 2018.Google Scholar
- J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In IEEE 34th International Conference on Data Engineering (ICDE), pages 1507--1518. IEEE, 2018.Google ScholarCross Ref
- J. Karimov, T. Rabl, and V. Markl. Astream: Ad-hoc shared stream processing. In SIGMOD 2019. ACM, 2019.Google ScholarDigital Library
- M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1):63--74, 1983.Google ScholarDigital Library
- D. Kossmann and K. Stocker. Iterative dynamic programming: a new class of query optimization algorithms. ACM Transactions on Database Systems (TODS), 25(1):43--82, 2000.Google Scholar
- R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries. PVLDB, 86:128--137, 1986.Google Scholar
- Q. Li, M. Shao, V. Markl, K. Beyer, L. Colby, and G. Lohman. Adaptively reordering joins during query execution. In IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007., pages 26--35. IEEE, 2007.Google ScholarCross Ref
- L. Lu, X. Zhu, R. Griffith, P. Padala, A. Parikh, P. Shah, and E. Smirni. Application-driven dynamic vertical scaling of virtual machines in resource pools. In 2014 IEEE Network Operations and Management Symposium (NOMS), pages 1--9. IEEE, 2014.Google ScholarCross Ref
- D. Makreshanski, G. Giannikis, G. Alonso, and D. Kossmann. Mqjoin: efficient shared execution of main-memory joins. PVLDB, 9(6):480--491, 2016.Google ScholarDigital Library
- D. Makreshanski, J. Giceva, C. Barthels, and G. Alonso. Batchdb: Efficient isolated execution of hybrid oltp+ olap workloads for interactive applications. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 37--50. ACM, 2017.Google ScholarDigital Library
- V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 659--670. ACM, 2004.Google ScholarDigital Library
- G. Moerkotte and T. Neumann. Dynamic programming strikes back. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 539--552. ACM, 2008.Google ScholarDigital Library
- W. Phillips. Meet the trolls. Index on Censorship, 40(2):68--76, 2011.Google ScholarCross Ref
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 23--34. ACM, 1979.Google ScholarDigital Library
- B. Suleiman, S. Sakr, R. Jeffery, and A. Liu. On understanding the economics and elasticity challenges of deploying business applications on public cloud infrastructure. Journal of Internet Services and Applications, 3(2):173--193, 2012.Google ScholarCross Ref
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 147--156. ACM, 2014.Google ScholarDigital Library
- I. Trummer and C. Koch. Solving the join ordering problem via mixed integer linear programming. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1025--1040. ACM, 2017.Google ScholarDigital Library
- A. Turner, A. Fox, J. Payne, and H. S. Kim. C-mart: Benchmarking the cloud. IEEE Transactions on Parallel and Distributed Systems, 24(6):1256--1266, 2012.Google ScholarDigital Library
- M. Turner, D. Budgen, and P. Brereton. Turning software into a service. Computer, 36(10):38--44, 2003.Google ScholarDigital Library
- S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. PVLDB, 29:285--296, 2003.Google Scholar
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages423--438. ACM, 2013.Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the, 2012.Google Scholar
- S. Zeuch, B. Del Monte, J. Karimov, C. Lutz, M. Renz, J. Traub, S. Breß, T. Rabl, and V. Markl. Analyzing efficient stream processing on modern hardware. PVLDB, 12(5):516--530, 2019.Google ScholarDigital Library
Index Terms
- AJoin: ad-hoc stream joins at scale
Recommendations
Distributed stream join query processing with semijoins
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database TheoryThe problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Scalable and efficient processing of top-k multiple-type integrated queries
AbstractIn this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Comments