ABSTRACT
The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This paper considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute six designs that capture salient trade-offs in this design space. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.
- http://www.accelio.org/.Google Scholar
- C. Barthels, G. Alonso, T. Hoefler, T. Schneider, and I. Müller. Distributed join algorithms on thousands of cores. PVLDB, 10(5):517--528, 2017. Google ScholarDigital Library
- C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. In SIGMOD '15, pages 1463--1475. ACM, 2015. Google ScholarDigital Library
- P. A. Boncz, T. Neumann, and O. Erling. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, August 26, 2013, Revised Selected Papers, pages 61--76, 2013.Google Scholar
- P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
- Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 26:1--26:17, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. I. Hsiao, and R. Rasmussen. The gamma database machine project. IEEE Trans. on Knowl. and Data Eng., 2(1):44--62, Mar. 1990. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Farm: Fast remote memory. NSDI'14, pages 401--414, 2014.Google Scholar
- A. P. Foong, T. R. Huff, H. H. Hum, J. R. Patwardhan, and G. J. Regnier. TCP performance re-visited. ISPASS '03, pp. 70--79, 2003. Google ScholarCross Ref
- P. W. Frey and G. Alonso. Minimizing the hidden cost of RDMA. ICDCS 2009, pages 553--560, 2009. Google ScholarDigital Library
- P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: High-speed networks for distributed join processing. DaMoN '09, pages 27--33, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- P. W. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. A spinning join that does not get dizzy. ICDCS 2010, pages 283--292, 2010. Google ScholarDigital Library
- G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120--135, 1994. Google ScholarDigital Library
- N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High performance RDMA-based design of HDFS over InfiniBand. SC '12, 2012. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM'14, pp. 295--306, 2014. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance rdma systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, Denver, CO, June 2016. USENIX Association.Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Fasst: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram rpcs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), GA, Nov. 2016. USENIX Association.Google Scholar
- A. Kesavan, R. Ricci, and R. Stutsman. To copy or not to copy: Making in-memory databases fast on modern nics. In ADMS-IMDM 2016, Nov. 2016.Google Scholar
- M. J. Koop, T. Jones, and D. K. Panda. Reducing connection memory requirements of MPI for InfiniBand clusters: A message coalescing approach. CCGrid 2007, pages 495--504, 2007. Google ScholarDigital Library
- V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. SIGMOD '14, pages 743--754. ACM, 2014. Google ScholarDigital Library
- F. Li, S. Das, M. Syamala, and V. R. Narasayya. Accelerating relational databases by leveraging remote memory and rdma. SIGMOD '16, pages 355--370. ACM, 2016.Google ScholarDigital Library
- Y. Li, I. Pandis, R. Müller, V. Raman, and G. M. Lohman. Numa-aware algorithms: the case of data shuffling. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.Google Scholar
- J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. R. Toonen. Design and implementation of MPICH2 over InfiniBand with RDMA support. CoRR, cs.AR/0310059, 2003.Google Scholar
- J. Liu, A. R. Mamidala, and D. K. Panda. Fast and scalable MPI-level broadcast using InfiniBand hardware multicast support. IPDPS, 2004.Google Scholar
- J. Liu, J. Wu, and D. K. Panda. High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program., 32(3):167--198, June 2004. Google ScholarDigital Library
- X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-performance design of Hadoop RPC with RDMA over InfiniBand. ICPP '13, pages 641--650, 2013. Google ScholarDigital Library
- P. MacArthur and R. D. Russell. A performance study to guide RDMA programming decisions. HPCC-ICESS, pp. 778--785, 2012. Google ScholarDigital Library
- C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. USENIX ATC'13, pages 103--114, 2013.Google Scholar
- http://www.mpi-forum.org/.Google Scholar
- H. Mühleisen, R. Gonçalves, and M. Kersten. Peak performance: Remote memory revisited. DaMoN '13, pages 9:1--9:7, 2013.Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
- Ohio Supercomputer Center. Ruby Supercomputer. http://osc.edu/ark:/19495/hpc93fc8, 2015.Google Scholar
- O. Polychroniou, R. Sen, and K. A. Ross. Track join: Distributed joins with minimal network traffic. SIGMOD'14, 1483-1494, 2014. Google ScholarDigital Library
- https://code.osu.edu/pythia/core.Google Scholar
- https://www.openfabrics.org/downloads/qperf/.Google Scholar
- W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks. ICDE'16, 2016.Google ScholarCross Ref
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarDigital Library
- W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-sensitive operators for parallel main-memory database clusters. ICDE 2014, pages 592--603, 2014. Google ScholarCross Ref
- https://linux.die.net/man/7/rsocket.Google Scholar
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarDigital Library
- A. Shatdal and J. F. Naughton. Adaptive parallel aggregation algorithms. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 104--114, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- C. Tinnefeld, D. Kossmann, J. Böse, and H. Plattner. Parallel join executions in RAMCloud. In Workshops Proceedings of the 30th International Conference on Data Engineering, pp. 182--190, 2014. Google ScholarCross Ref
- X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 87--104, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. GraM: Scaling graph computation to the trillions. SoCC '15, pages 408--421, 2015. Google ScholarDigital Library
Recommendations
Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems
Best of EDBT 2017, Best of EDBT 2018, Best of ICDT 2018 and Regular PapersThe commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory ...
Content-Aware Bit Shuffling for Maximizing PCM Endurance
Recently, phase change memory (PCM) has been emerging as a strong replacement for DRAM owing to its many advantages such as nonvolatility, high capacity, low leakage power, and so on. However, PCM is still restricted for use as main memory because of ...
Locality aware management on NAND flash-based main memory for in-memory database systems
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and TheoryConventional database systems manage all data on hard disks, but due to a hard disk's frequent I/O operations, this kind of management exposes critical problems when data is huge or operations are complex and frequent. As the size of the main memory ...
Comments