skip to main content
10.1145/3035918.3064043acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Published:09 May 2017Publication History

ABSTRACT

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.

References

  1. M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Ashkiani, A. A. Davidson, U. Meyer, and J. D. Owens. GPU Multisplit. CoRR, abs/1701.01189, January 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Baxter. Modern GPU. https://github.com/moderngpu/moderngpu, 2016.Google ScholarGoogle Scholar
  5. B. Chandramouli and J. Goldstein. Patience is a virtue: Revisiting merge and sort on modern processors. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Cho, D. Brand, R. Bordawekar, U. Finkler, V. Kulandaisamy, and R. Puri. Paradis: An efficient parallel algorithm for in-place radix sort. PVLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cisco visual networking index: Global mobile data traffic forecast update, 2015--2020 white paper. Technical report, Cisco, 2016.Google ScholarGoogle Scholar
  10. A. Davidson, D. Tarjan, M. Garland, and J. D. Owens. Efficient parallel merge sort for fixed and variable length keys. InPar 2012, 2012.Google ScholarGoogle Scholar
  11. F. Dehne and H. Zaboli. Deterministic sample sort for GPUs. Parallel Processing Letters, 2012.Google ScholarGoogle Scholar
  12. N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: High performance graphics co-processor sorting for large database management. SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Graefe. Implementing sorting in database systems. ACM Computing Surveys (CSUR), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In ACM SIGMOD Record, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. ICS 2012, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Ha, J. Krüger, and C. T. Silva. Fast four-way parallel radix sorting on gpus. Computer Graphics Forum, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Harris, S. Sengupta, and J. D. Owens. Gpu gems 3. Parallel Prefix Sum (Scan) with CUDA, 2007.Google ScholarGoogle Scholar
  18. B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. SC '07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Herf. Radix tricks. http://stereopsis.com/radix.html, 2001.Google ScholarGoogle Scholar
  20. J. Hoberock and N. Bell. Thrust: A parallel template library. https://thrust.github.io, 2016.Google ScholarGoogle Scholar
  21. H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. Aa-sort: A new parallel sorting algorithm for multi-core simd processors. PACT '07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Inoue and K. Taura. Simd- and cache-friendly algorithm for sorting an array of structures. PVLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. PVLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. Cloudramsort: Fast and efficient large-scale distributed ram sort on shared-nothing cluster. SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Kipfer and R. Westermann. Improved gpu sorting. GPU gems, 2:733--746, 2005.Google ScholarGoogle Scholar
  26. J. Krueger, M. Grund, I. Jaeckel, A. Zeier, and H. Plattner. Applicability of gpu computing for efficient merge in in-memory databases. In ADMS@ VLDB, 2011.Google ScholarGoogle Scholar
  27. N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort. IPDPS, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  28. D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for gpu computing. Parallel Processing Letters, 2011.Google ScholarGoogle Scholar
  29. D. Merrill and NVIDIA Corporation. CUB. https://github.com/NVlabs/cub, 2016.Google ScholarGoogle Scholar
  30. M. Najafi, M. Sadoghi, and H. A. Jacobsen. Configurable hardware-based streaming architecture using online programmable-blocks. ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  31. NVIDIA GeForce GTX 980. Whitepaper. Technical report, NVIDIA, 2014.Google ScholarGoogle Scholar
  32. NVIDIA Tesla P100. Whitepaper. Technical report, NVIDIA, 2016.Google ScholarGoogle Scholar
  33. O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. IPDPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: A case for bandwidth oblivious simd sort. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Shahvarani and H.-A. Jacobsen. A hybrid b+-tree as solution for in-memory indexing on cpu-gpu heterogeneous computing platforms. SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Sintorn and U. Assarsson. Fast parallel gpu-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. I. Tanasic, L. Vilanova, M. Jordà, J. Cabezas, I. Gelado, N. Navarro, and W.-m. Hwu. Comparison based sorting for systems with multiple gpus. GPGPU, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Thearling and S. Smith. An improved supercomputer sorting benchmark. SC '92, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Wassenberg and P. Sanders. Engineering a Multi-core Radix Sort. Euro-Par 2011. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne. High performance comparison-based sorting algorithm on many-core gpus. IPDPS, 2010.Google ScholarGoogle Scholar

Index Terms

  1. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
        May 2017
        1810 pages
        ISBN:9781450341974
        DOI:10.1145/3035918

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader