skip to main content
research-article

Scalable and adaptive online joins

Published:01 February 2014Publication History
Skip Abstract Section

Abstract

Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly.

This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3:75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time.

References

  1. The TPC-H benchmark. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  2. D. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In CIDR, pages 277--289, 2005.Google ScholarGoogle Scholar
  3. F. Afrati and J. Ullman. Optimizing joins in a MapReduce environment. In EDBT, pages 99--110, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. STREAM: The Stanford data stream management system. Technical report, Stanford InfoLab, 2004.Google ScholarGoogle Scholar
  5. A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts. Linear road: a stream data management benchmark. In VLDB, pages 480--491, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Avnur and J. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, pages 261--272, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Babu and P. Bizarro. Adaptive query processing in the looking glass. In CIDR, pages 238--249, 2005.Google ScholarGoogle Scholar
  8. S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Fernandez, M. Migliavacca, E. Kalyvianaki and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages 725--736, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri and V. Narasayya. TPC-D data generation with skew.Google ScholarGoogle Scholar
  11. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Deshpande and J. Hellerstein. Lifting the burden of history from adaptive query processing. In VLDB, pages 948--959, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dittrich, B. Seeger, D. Taylor, and P. Widmayer. Progressive merge join: a generic and non-blocking sort-based join algorithm. In VLDB, pages 299--310, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. Scalable and adaptive online joins. EPFL-REPORT 190035 Technical Report, 2013.Google ScholarGoogle Scholar
  16. A. Gounaris, N. Paton, A. Fernandes, and R. Sakellariou. Adaptive query processing: A survey. In British National Conference on Databases, pages 11--25, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Gounaris, E. Tsamoura, and Y. Manolopoulos. Adaptive query processing in distributed settings. Advanced Query Processing, 36(1):211--236, 2012.Google ScholarGoogle Scholar
  18. G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Gu, P. Yu, and H. Wang. Adaptive load diffusion for multiway windowed stream joins. In ICDE, pages 146--155, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. P. Haas and J. Hellerstein. Ripple joins for online aggregation. In SIGMOD, pages 287--298, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Hellerstein, M. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2), 2000.Google ScholarGoogle Scholar
  22. J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In SIGMOD, pages 171--182, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD, pages 268--277, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Liu, M. Jbantova, and E. Rundensteiner. Optimizing state-intensive non-blocking queries using run-time adaptation. In ICDE Workshop, page 614--623, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Mokbel, M. Lu, and W. Aref. Hash-Merge join: A non-blocking join algorithm for producing fast and early join results. In ICDE, pages 251--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Olson, K. Bostic, and M. Seltzer. Berkeley DB. In Annual Technical Conference, USENIX, pages 43--43, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In Annual Technical Conference, USENIX, pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Paton, J. Buenabad, M. Chen, V. Raman, G. Swart, I. Narang, D. Yellin, and A. Fernandes. Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options. VLDBJ, 18(1):119--140, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Schneider and D. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, pages 110--121, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25--36, 2002.Google ScholarGoogle Scholar
  32. J. Stamos and H. Young. A symmetric fragment and replicate algorithm for distributed joins. Transactions on Parallel and Distributed Systems, 4(12):1345--1354, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO - DB2's learning optimizer. In VLDB, pages 19--28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis. RPJ: producing fast join results on streams through rate-based optimization. In SIGMOD, pages 371--382, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Tian and D. DeWitt. Tuple routing strategies for distributed eddies. In VLDB, pages 333--344, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD, pages 241--252, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Urhan and M. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, 23(2):27--33, 2000.Google ScholarGoogle Scholar
  38. S. Wang and E. Rundensteiner. Scalable stream join processing with expensive predicates: workload distribution and adaptation by time-slicing. In EDBT, pages 299--310, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Wilschut and P. Apers. Dataflow query execution in a parallel main-memory environment. In Parallel and Distributed Information Systems, pages 68--77, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Xing, S. Zdonik, and J. Hwang. Dynamic load distribution in the Borealis stream processor. In ICDE, pages 791--802, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Yang, A. Dasdan, R. Hsiao, and D. Parker. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using MapReduce. VLDBJ, 5(11):1184--1195, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Zhou, B. Ooi, and K. Tan. Dynamic load management for distributed continuous query systems. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable and adaptive online joins
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 7, Issue 6
        February 2014
        64 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 February 2014
        Published in pvldb Volume 7, Issue 6

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader