skip to main content
10.1145/2588555.2610521acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Track join: distributed joins with minimal network traffic

Published:18 June 2014Publication History

ABSTRACT

Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of tuples across the network. We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Our evaluation based on real and synthetic data shows that track join adapts to diverse cases and degrees of locality. Considering both network traffic and execution time, even with no locality, track join outperforms hash join on the most expensive queries of real workloads.

References

  1. F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10):1064--1075, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Balkesen et al. Multicore, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, Sept. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Bernstein and D.-M. W. Chiu. Using semi-joins to solve relational queries. J. ACM, 28(1):25--40, Jan. 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. A. Bernstein et al. Query processing in a system for distributed databases. ACM Trans. Database Systems, 6:602--625, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Comm. ACM, 13(7):422--426, July 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. J. DeWitt et al. Implementation techniques for main memory database systems. In SIGMOD, pages 1--8, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt et al. The Gamma database machine project. IEEE Trans. Knowl. Data Engin., 2(1):44--62, Mar. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. J. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Comm. ACM, 35:85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In SIGMOD, pages 169--180, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: high-speed networks for distributed join processing. In DaMoN, pages 27--33, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Grund et al. HYRISE: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Johnson et al. Shore-MT: a scalable storage manager for the multicore era. In EDBT, pages 24--35, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Kim et al. Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. PVLDB, 2(2):1378--1389, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Kim et al. CloudRAMsort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In SIGMOD, pages 841--850, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1:63--74, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Lemire et al. Decoding billions of integers per second through vectorization. Software: Pract. Exper., May 2013.Google ScholarGoogle Scholar
  19. Z. Li and K. A. Ross. PERF join: an alternative to two-way semijoin and Bloomjoin. In CIKM, pages 137--144, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed queries. In VLDB, pages 149--159, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: memory access. VLDB J., 9(3):231--246, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Software Engin., 16(5):558--560, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Roediger et al. Locality-sensitive operators for parallel main-memory database clusters. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. N. Roussopoulos and H. Kang. A pipeline n-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Engin., 3(4):486--495, Dec. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. W. Stamos and H. C. Young. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distributed Systems, 4(12):1345--1354, Dec. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Stonebraker et al. C-store: a column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Stonebraker et al. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150--1160, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Urhan and M. J. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engin. Bulletin, 23:27--33, June 2000.Google ScholarGoogle Scholar
  34. J. Wassenberg and P. Sanders. Engineering a multi core radix sort. In EuroPar, pages 160--169, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Willhalm et al. SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Ye, K. A. Ross, and N. Vesdapunt. Scalable aggregation on multicore processors. In DaMoN, pages 1--9, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Yu et al. CS2: a new database synopsis for query estimation. In SIGMOD, pages 469--480, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Track join: distributed joins with minimal network traffic

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader