skip to main content
10.1145/3064176.3064202acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

Published:23 April 2017Publication History

ABSTRACT

The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This paper considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute six designs that capture salient trade-offs in this design space. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.

References

  1. http://www.accelio.org/.Google ScholarGoogle Scholar
  2. C. Barthels, G. Alonso, T. Hoefler, T. Schneider, and I. Müller. Distributed join algorithms on thousands of cores. PVLDB, 10(5):517--528, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. In SIGMOD '15, pages 1463--1475. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Boncz, T. Neumann, and O. Erling. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, August 26, 2013, Revised Selected Papers, pages 61--76, 2013.Google ScholarGoogle Scholar
  5. P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google ScholarGoogle Scholar
  6. Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 26:1--26:17, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. I. Hsiao, and R. Rasmussen. The gamma database machine project. IEEE Trans. on Knowl. and Data Eng., 2(1):44--62, Mar. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Farm: Fast remote memory. NSDI'14, pages 401--414, 2014.Google ScholarGoogle Scholar
  9. A. P. Foong, T. R. Huff, H. H. Hum, J. R. Patwardhan, and G. J. Regnier. TCP performance re-visited. ISPASS '03, pp. 70--79, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  10. P. W. Frey and G. Alonso. Minimizing the hidden cost of RDMA. ICDCS 2009, pages 553--560, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: High-speed networks for distributed join processing. DaMoN '09, pages 27--33, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. W. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. A spinning join that does not get dizzy. ICDCS 2010, pages 283--292, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120--135, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High performance RDMA-based design of HDFS over InfiniBand. SC '12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM'14, pp. 295--306, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance rdma systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, Denver, CO, June 2016. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Kalia, M. Kaminsky, and D. G. Andersen. Fasst: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram rpcs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), GA, Nov. 2016. USENIX Association.Google ScholarGoogle Scholar
  18. A. Kesavan, R. Ricci, and R. Stutsman. To copy or not to copy: Making in-memory databases fast on modern nics. In ADMS-IMDM 2016, Nov. 2016.Google ScholarGoogle Scholar
  19. M. J. Koop, T. Jones, and D. K. Panda. Reducing connection memory requirements of MPI for InfiniBand clusters: A message coalescing approach. CCGrid 2007, pages 495--504, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. SIGMOD '14, pages 743--754. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Li, S. Das, M. Syamala, and V. R. Narasayya. Accelerating relational databases by leveraging remote memory and rdma. SIGMOD '16, pages 355--370. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Li, I. Pandis, R. Müller, V. Raman, and G. M. Lohman. Numa-aware algorithms: the case of data shuffling. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.Google ScholarGoogle Scholar
  23. J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. R. Toonen. Design and implementation of MPICH2 over InfiniBand with RDMA support. CoRR, cs.AR/0310059, 2003.Google ScholarGoogle Scholar
  24. J. Liu, A. R. Mamidala, and D. K. Panda. Fast and scalable MPI-level broadcast using InfiniBand hardware multicast support. IPDPS, 2004.Google ScholarGoogle Scholar
  25. J. Liu, J. Wu, and D. K. Panda. High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program., 32(3):167--198, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-performance design of Hadoop RPC with RDMA over InfiniBand. ICPP '13, pages 641--650, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. MacArthur and R. D. Russell. A performance study to guide RDMA programming decisions. HPCC-ICESS, pp. 778--785, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. USENIX ATC'13, pages 103--114, 2013.Google ScholarGoogle Scholar
  29. http://www.mpi-forum.org/.Google ScholarGoogle Scholar
  30. H. Mühleisen, R. Gonçalves, and M. Kersten. Peak performance: Remote memory revisited. DaMoN '13, pages 9:1--9:7, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ohio Supercomputer Center. Ruby Supercomputer. http://osc.edu/ark:/19495/hpc93fc8, 2015.Google ScholarGoogle Scholar
  33. O. Polychroniou, R. Sen, and K. A. Ross. Track join: Distributed joins with minimal network traffic. SIGMOD'14, 1483-1494, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. https://code.osu.edu/pythia/core.Google ScholarGoogle Scholar
  35. https://www.openfabrics.org/downloads/qperf/.Google ScholarGoogle Scholar
  36. W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks. ICDE'16, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  37. W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-sensitive operators for parallel main-memory database clusters. ICDE 2014, pages 592--603, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  39. https://linux.die.net/man/7/rsocket.Google ScholarGoogle Scholar
  40. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Shatdal and J. F. Naughton. Adaptive parallel aggregation algorithms. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 104--114, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Tinnefeld, D. Kossmann, J. Böse, and H. Plattner. Parallel join executions in RAMCloud. In Workshops Proceedings of the 30th International Conference on Data Engineering, pp. 182--190, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  43. X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 87--104, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. GraM: Scaling graph computation to the trillions. SoCC '15, pages 408--421, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems
    April 2017
    648 pages
    ISBN:9781450349383
    DOI:10.1145/3064176

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 April 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate241of1,308submissions,18%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader