research-article

Track join: distributed joins with minimal network traffic

Authors:
Orestis Polychroniou

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

,
Rajkumar Sen

Oracle Corporation, Redwood Shores, CA, USA

Oracle Corporation, Redwood Shores, CA, USA
View Profile

,
Kenneth A. Ross

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 1483–1494https://doi.org/10.1145/2588555.2610521

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1483–1494

ABSTRACT

Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of tuples across the network. We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Our evaluation based on real and synthetic data shows that track join adapts to diverse cases and degrees of locality. Considering both network traffic and execution time, even with no locality, track join outperforms hash join on the most expensive queries of real workloads.

References

F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google ScholarDigital Library
M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10):1064--1075, June 2012. Google ScholarDigital Library
C. Balkesen et al. Multicore, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, Sept. 2013.Google ScholarDigital Library
P. A. Bernstein and D.-M. W. Chiu. Using semi-joins to solve relational queries. J. ACM, 28(1):25--40, Jan. 1981. Google ScholarDigital Library
P. A. Bernstein et al. Query processing in a system for distributed databases. ACM Trans. Database Systems, 6:602--625, 1981. Google ScholarDigital Library
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Comm. ACM, 13(7):422--426, July 1970. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
D. J. DeWitt et al. Implementation techniques for main memory database systems. In SIGMOD, pages 1--8, 1984. Google ScholarDigital Library
D. J. DeWitt et al. The Gamma database machine project. IEEE Trans. Knowl. Data Engin., 2(1):44--62, Mar. 1990. Google ScholarDigital Library
D. J. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Comm. ACM, 35:85--98, 1992. Google ScholarDigital Library
R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In SIGMOD, pages 169--180, 1978. Google ScholarDigital Library
P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: high-speed networks for distributed join processing. In DaMoN, pages 27--33, 2009. Google ScholarDigital Library
M. Grund et al. HYRISE: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010. Google ScholarDigital Library
R. Johnson et al. Shore-MT: a scalable storage manager for the multicore era. In EDBT, pages 24--35, 2009. Google ScholarDigital Library
C. Kim et al. Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. PVLDB, 2(2):1378--1389, Aug. 2009. Google ScholarDigital Library
C. Kim et al. CloudRAMsort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In SIGMOD, pages 841--850, 2012. Google ScholarDigital Library
M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1:63--74, 1983.Google ScholarDigital Library
D. Lemire et al. Decoding billions of integers per second through vectorization. Software: Pract. Exper., May 2013.Google Scholar
Z. Li and K. A. Ross. PERF join: an alternative to two-way semijoin and Bloomjoin. In CIKM, pages 137--144, 1995. Google ScholarDigital Library
L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed queries. In VLDB, pages 149--159, 1986. Google ScholarDigital Library
S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: memory access. VLDB J., 9(3):231--246, 2000. Google ScholarDigital Library
J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Software Engin., 16(5):558--560, May 1990. Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In SIGMOD, 2014. Google ScholarDigital Library
V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, Aug. 2013. Google ScholarDigital Library
W. Roediger et al. Locality-sensitive operators for parallel main-memory database clusters. In ICDE, 2014.Google ScholarCross Ref
N. Roussopoulos and H. Kang. A pipeline n-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Engin., 3(4):486--495, Dec. 1991. Google ScholarDigital Library
N. Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarDigital Library
J. W. Stamos and H. C. Young. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distributed Systems, 4(12):1345--1354, Dec. 1993. Google ScholarDigital Library
M. Stonebraker et al. C-store: a column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
M. Stonebraker et al. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150--1160, 2007. Google ScholarDigital Library
T. Urhan and M. J. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engin. Bulletin, 23:27--33, June 2000.Google Scholar
J. Wassenberg and P. Sanders. Engineering a multi core radix sort. In EuroPar, pages 160--169, 2011. Google ScholarDigital Library
T. Willhalm et al. SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, Aug. 2009. Google ScholarDigital Library
Y. Ye, K. A. Ross, and N. Vesdapunt. Scalable aggregation on multicore processors. In DaMoN, pages 1--9, 2011. Google ScholarDigital Library
F. Yu et al. CS2: a new database synopsis for query estimation. In SIGMOD, pages 469--480, 2013. Google ScholarDigital Library

Index Terms

Track join: distributed joins with minimal network traffic
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Many-query join: efficient shared execution of relational joins on modern hardware

Database architectures typically process queries one at a time, executing concurrent queries in independent execution contexts. Often, such a design leads to unpredictable performance and poor scalability. One approach to circumvent the problem is to ...
Read More
Wander Join and XDB: Online Aggregation via Random Walks
Best of EDBT 2017, Best of SIGMOD 2016 and Regular Papers

Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in ...
Read More
Adaptive Join Operators for Result Rate Optimization on Streaming Inputs

Adaptive join algorithms have recently attracted a lot of attention in emerging applications where data are provided by autonomous data sources through heterogeneous network environments. Their main advantage over traditional join techniques is that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
join
network optimization
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 102
  Total Citations
  View Citations
- 1,148
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Track join: distributed joins with minimal network traffic

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Many-query join: efficient shared execution of relational joins on modern hardware

Wander Join and XDB: Online Aggregation via Random Walks

Adaptive Join Operators for Result Rate Optimization on Streaming Inputs