skip to main content
research-article

AJoin: ad-hoc stream joins at scale

Published:09 December 2019Publication History
Skip Abstract Section

Abstract

The processing model of state-of-the-art stream processing engines is designed to execute long-running queries one at a time. However, with the advance of cloud technologies and multi-tenant systems, multiple users share the same cloud for stream query processing. This results in many ad-hoc stream queries sharing common stream sources. Many of these queries include joins.

There are two main limitations that hinder performing ad-hoc stream join processing. The first limitation is missed optimization potential both in stream data processing and query optimization layers. The second limitation is the lack of dynamicity in query execution plans.

We present AJoin, a dynamic and incremental ad-hoc stream join framework. AJoin consists of an optimization layer and a stream data processing layer. The optimization layer periodically reoptimizes the query execution plan, performing join reordering and vertical and horizontal scaling at run-time without stopping the execution. The data processing layer implements pipelineparallel join architecture. This layer enables incremental and consistent query processing supporting all the actions triggered by the optimizer. We implement AJoin on top of Apache Flink, an open-source data processing framework. AJoin outperforms Flink not only at ad-hoc multi-query workloads but also at single-query workloads.

References

  1. Apache Kafka. https://kafka.apache.org/, 2019. [Online: accessed 17-August-2019].Google ScholarGoogle Scholar
  2. D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245--248, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Anglano, M. Canonico, and M. Guazzone. Fc2q: exploiting fuzzy control in server consolidation for cloud applications with sla constraints. Concurrency and Computation: Practice and Experience, 27(17):4491--4514, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative api for real-time applications in apache spark. In Proceedings of the 2018 International Conference on Management of Data, pages 601--613. ACM, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383--1394. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 519--530. ACM, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMOD record, volume 29, pages 261--272. ACM, 2000.Google ScholarGoogle Scholar
  8. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1--16. ACM, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Beitch, B. Liu, T. Yung, R. Griffith, A. Fox, D. A. Patterson, et al. Rain: A workload generation toolkit for cloud computing applications. University of California, Tech. Rep. UCB/EECS-2010-14, 2010.Google ScholarGoogle Scholar
  10. S. Bradshaw and P. Howard. Troops, trolls and troublemakers: A global inventory of organized social media manipulation. 2017.Google ScholarGoogle Scholar
  11. L. Braun, T. Etter, G. Gasparis, M. Kaufmann, D. Kossmann, D. Widmer, A. Avitzur, A. Iliopoulos, E. Levy, and N. Liang. Analytics in motion: High performance event-processing and real-time analytics in the same database. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages251--264. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Candea, N. Polyzotis, and R. Vingralek. Predictable performance and high query concurrency for data analytics. The VLDB Journal, 20(2):227--248, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in apache flink®: consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google ScholarGoogle Scholar
  16. V. Cardellini, M. Nardelli, and D. Luzi. Elastic stateful stream processing in storm. In High Performance Computing & Simulation (HPCS), 2016 International Conference on, pages 583--590. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  17. Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer. Path sharing and predicate evaluation for high-performance xml filtering. ACM Transactions on Database Systems (TODS), 28(4):467--516, 2003.Google ScholarGoogle Scholar
  18. T. Dokeroglu, S. Ozal, M. A. Bayir, M. S. Cinar, and A. Cosar. Improving the performance of hadoop hive by sharing scan and computation tasks. Journal of Cloud Computing, 3(1):12, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Ebenstein, N. Kamat, and A. Nandi. Fluxquery: An execution framework for highly interactive query workloads. In Proceedings of the 2016 International Conference on Management of Data, pages 1333--1345. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica. Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28(13):2009, 2009.Google ScholarGoogle Scholar
  21. B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel & Distributed Systems, (1):1--1, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Giannikis, G. Alonso, and D. Kossmann. SharedDB: killing one thousand queries with one stone. PVLDB, 5(6):526--537, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Workload optimization using shareddb. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1045--1048. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Shared workload optimization. PVLDB, 7(6):429--440, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Giceva, G. Alonso, T. Roscoe, and T. Harris. Deployment of query plans on multicores. PVLDB, 8(3):233--244, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. A. Hammad, M. J. Franklin, W. G. Aref, and A. K. Elmagarmid. Scheduling for shared window joins over data streams. PVLDB, 29:297--308, 2003.Google ScholarGoogle Scholar
  27. T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer. Latency-aware elastic scaling for distributed data stream processing systems. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, pages 13--22. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Heinze, Y. Ji, L. Roediger, V. Pappalardo, A. Meister, Z. Jerzak, and C. Fetzer. Fugu: Elastic data stream processing with latency constraints. IEEE Data Eng. Bull., 38(4):73--81, 2015.Google ScholarGoogle Scholar
  29. T. Heinze, L. Roediger, A. Meister, Y. Ji, Z. Jerzak, and C. Fetzer. Online parameter optimization for elastic data stream processing. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 276--287. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. R. Herbst, S. Kounev, et al. Modeling variations in load intensity over time. In Proceedings of the third international workshop on Large scale testing, pages 1--4. ACM, 2014.Google ScholarGoogle Scholar
  31. T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Transactions on Database Systems (TODS), 9(3):482--502, 1984.Google ScholarGoogle Scholar
  32. G. Jacques-Silva, R. Lei, L. Cheng, G. J. Chen, K. Ching, T. Hu, Y. Mei, K. Wilfong, R. Shetty, S. Yilmaz, et al. Providing streaming joins as a service at facebook. PVLDB, 11(12):1809--1821, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Karimov. Stream Benchmarks, pages 1--6. Springer International Publishing, Cham, 2018.Google ScholarGoogle Scholar
  34. J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In IEEE 34th International Conference on Data Engineering (ICDE), pages 1507--1518. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Karimov, T. Rabl, and V. Markl. Astream: Ad-hoc shared stream processing. In SIGMOD 2019. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1):63--74, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Kossmann and K. Stocker. Iterative dynamic programming: a new class of query optimization algorithms. ACM Transactions on Database Systems (TODS), 25(1):43--82, 2000.Google ScholarGoogle Scholar
  38. R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries. PVLDB, 86:128--137, 1986.Google ScholarGoogle Scholar
  39. Q. Li, M. Shao, V. Markl, K. Beyer, L. Colby, and G. Lohman. Adaptively reordering joins during query execution. In IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007., pages 26--35. IEEE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  40. L. Lu, X. Zhu, R. Griffith, P. Padala, A. Parikh, P. Shah, and E. Smirni. Application-driven dynamic vertical scaling of virtual machines in resource pools. In 2014 IEEE Network Operations and Management Symposium (NOMS), pages 1--9. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  41. D. Makreshanski, G. Giannikis, G. Alonso, and D. Kossmann. Mqjoin: efficient shared execution of main-memory joins. PVLDB, 9(6):480--491, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Makreshanski, J. Giceva, C. Barthels, and G. Alonso. Batchdb: Efficient isolated execution of hybrid oltp+ olap workloads for interactive applications. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 37--50. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 659--670. ACM, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. G. Moerkotte and T. Neumann. Dynamic programming strikes back. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 539--552. ACM, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. W. Phillips. Meet the trolls. Index on Censorship, 40(2):68--76, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  46. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 23--34. ACM, 1979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. B. Suleiman, S. Sakr, R. Jeffery, and A. Liu. On understanding the economics and elasticity challenges of deploying business applications on public cloud infrastructure. Journal of Internet Services and Applications, 3(2):173--193, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  48. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 147--156. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. I. Trummer and C. Koch. Solving the join ordering problem via mixed integer linear programming. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1025--1040. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. A. Turner, A. Fox, J. Payne, and H. S. Kim. C-mart: Benchmarking the cloud. IEEE Transactions on Parallel and Distributed Systems, 24(6):1256--1266, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Turner, D. Budgen, and P. Brereton. Turning software into a service. Computer, 36(10):38--44, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. PVLDB, 29:285--296, 2003.Google ScholarGoogle Scholar
  53. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages423--438. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the, 2012.Google ScholarGoogle Scholar
  55. S. Zeuch, B. Del Monte, J. Karimov, C. Lutz, M. Renz, J. Traub, S. Breß, T. Rabl, and V. Markl. Analyzing efficient stream processing on modern hardware. PVLDB, 12(5):516--530, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AJoin: ad-hoc stream joins at scale
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 13, Issue 4
          December 2019
          167 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 9 December 2019
          Published in pvldb Volume 13, Issue 4

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader