research-article

Public Access

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

Authors:
Feilong Liu

The Ohio State University

The Ohio State University
View Profile

,
Lingyan Yin

The Ohio State University

The Ohio State University
View Profile

,
Spyros Blanas

The Ohio State University

The Ohio State University
View Profile

EuroSys '17: Proceedings of the Twelfth European Conference on Computer SystemsApril 2017Pages 48–63https://doi.org/10.1145/3064176.3064202

Published:23 April 2017Publication History

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

Pages 48–63

ABSTRACT

The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This paper considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute six designs that capture salient trade-offs in this design space. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.

References

http://www.accelio.org/.Google Scholar
C. Barthels, G. Alonso, T. Hoefler, T. Schneider, and I. Müller. Distributed join algorithms on thousands of cores. PVLDB, 10(5):517--528, 2017. Google ScholarDigital Library
C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. In SIGMOD '15, pages 1463--1475. ACM, 2015. Google ScholarDigital Library
P. A. Boncz, T. Neumann, and O. Erling. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, August 26, 2013, Revised Selected Papers, pages 61--76, 2013.Google Scholar
P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 26:1--26:17, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. I. Hsiao, and R. Rasmussen. The gamma database machine project. IEEE Trans. on Knowl. and Data Eng., 2(1):44--62, Mar. 1990. Google ScholarDigital Library
A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Farm: Fast remote memory. NSDI'14, pages 401--414, 2014.Google Scholar
A. P. Foong, T. R. Huff, H. H. Hum, J. R. Patwardhan, and G. J. Regnier. TCP performance re-visited. ISPASS '03, pp. 70--79, 2003. Google ScholarCross Ref
P. W. Frey and G. Alonso. Minimizing the hidden cost of RDMA. ICDCS 2009, pages 553--560, 2009. Google ScholarDigital Library
P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: High-speed networks for distributed join processing. DaMoN '09, pages 27--33, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
P. W. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. A spinning join that does not get dizzy. ICDCS 2010, pages 283--292, 2010. Google ScholarDigital Library
G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120--135, 1994. Google ScholarDigital Library
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High performance RDMA-based design of HDFS over InfiniBand. SC '12, 2012. Google ScholarDigital Library
A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM'14, pp. 295--306, 2014. Google ScholarDigital Library
A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance rdma systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, Denver, CO, June 2016. USENIX Association.Google ScholarDigital Library
A. Kalia, M. Kaminsky, and D. G. Andersen. Fasst: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram rpcs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), GA, Nov. 2016. USENIX Association.Google Scholar
A. Kesavan, R. Ricci, and R. Stutsman. To copy or not to copy: Making in-memory databases fast on modern nics. In ADMS-IMDM 2016, Nov. 2016.Google Scholar
M. J. Koop, T. Jones, and D. K. Panda. Reducing connection memory requirements of MPI for InfiniBand clusters: A message coalescing approach. CCGrid 2007, pages 495--504, 2007. Google ScholarDigital Library
V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. SIGMOD '14, pages 743--754. ACM, 2014. Google ScholarDigital Library
F. Li, S. Das, M. Syamala, and V. R. Narasayya. Accelerating relational databases by leveraging remote memory and rdma. SIGMOD '16, pages 355--370. ACM, 2016.Google ScholarDigital Library
Y. Li, I. Pandis, R. Müller, V. Raman, and G. M. Lohman. Numa-aware algorithms: the case of data shuffling. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.Google Scholar
J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. R. Toonen. Design and implementation of MPICH2 over InfiniBand with RDMA support. CoRR, cs.AR/0310059, 2003.Google Scholar
J. Liu, A. R. Mamidala, and D. K. Panda. Fast and scalable MPI-level broadcast using InfiniBand hardware multicast support. IPDPS, 2004.Google Scholar
J. Liu, J. Wu, and D. K. Panda. High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program., 32(3):167--198, June 2004. Google ScholarDigital Library
X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-performance design of Hadoop RPC with RDMA over InfiniBand. ICPP '13, pages 641--650, 2013. Google ScholarDigital Library
P. MacArthur and R. D. Russell. A performance study to guide RDMA programming decisions. HPCC-ICESS, pp. 778--785, 2012. Google ScholarDigital Library
C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. USENIX ATC'13, pages 103--114, 2013.Google Scholar
http://www.mpi-forum.org/.Google Scholar
H. Mühleisen, R. Gonçalves, and M. Kersten. Peak performance: Remote memory revisited. DaMoN '13, pages 9:1--9:7, 2013.Google ScholarDigital Library
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
Ohio Supercomputer Center. Ruby Supercomputer. http://osc.edu/ark:/19495/hpc93fc8, 2015.Google Scholar
O. Polychroniou, R. Sen, and K. A. Ross. Track join: Distributed joins with minimal network traffic. SIGMOD'14, 1483-1494, 2014. Google ScholarDigital Library
https://code.osu.edu/pythia/core.Google Scholar
https://www.openfabrics.org/downloads/qperf/.Google Scholar
W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks. ICDE'16, 2016.Google ScholarCross Ref
W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarDigital Library
W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-sensitive operators for parallel main-memory database clusters. ICDE 2014, pages 592--603, 2014. Google ScholarCross Ref
https://linux.die.net/man/7/rsocket.Google Scholar
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarDigital Library
A. Shatdal and J. F. Naughton. Adaptive parallel aggregation algorithms. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 104--114, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
C. Tinnefeld, D. Kossmann, J. Böse, and H. Plattner. Parallel join executions in RAMCloud. In Workshops Proceedings of the 30th International Conference on Data Engineering, pp. 182--190, 2014. Google ScholarCross Ref
X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 87--104, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. GraM: Scaling graph computation to the trillions. SoCC '15, pages 408--421, 2015. Google ScholarDigital Library

Recommendations

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems
Best of EDBT 2017, Best of EDBT 2018, Best of ICDT 2018 and Regular Papers

The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory ...
Read More
Content-Aware Bit Shuffling for Maximizing PCM Endurance

Recently, phase change memory (PCM) has been emerging as a strong replacement for DRAM owing to its many advantages such as nonvolatility, high capacity, low leakage power, and so on. However, PCM is still restricted for use as main memory because of ...
Read More
Locality aware management on NAND flash-based main memory for in-memory database systems
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and Theory

Conventional database systems manage all data on hard disks, but due to a hard disk's frequent I/O operations, this kind of management exposes critical problems when data is huge or operations are complex and frequent. As the size of the main memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems
April 2017
648 pages
ISBN:9781450349383
DOI:10.1145/3064176

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 872
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

ABSTRACT

References

Cited By

Recommendations

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

Content-Aware Bit Shuffling for Maximizing PCM Endurance

Locality aware management on NAND flash-based main memory for in-memory database systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

ABSTRACT

References

Cited By

Recommendations

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

Content-Aware Bit Shuffling for Maximizing PCM Endurance

Locality aware management on NAND flash-based main memory for in-memory database systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media