research-article

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Authors:
Elias Stehle

Technical University of Munich (TUM), Munich, Germany

Technical University of Munich (TUM), Munich, Germany
View Profile

,
Hans-Arno Jacobsen

Technical University of Munich (TUM), Munich, Germany

Technical University of Munich (TUM), Munich, Germany
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 417–432https://doi.org/10.1145/3035918.3064043

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 417–432

ABSTRACT

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.

References

M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 2012. Google ScholarDigital Library
S. Ashkiani, A. A. Davidson, U. Meyer, and J. D. Owens. GPU Multisplit. CoRR, abs/1701.01189, January 2017. Google ScholarDigital Library
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 2013. Google ScholarDigital Library
S. Baxter. Modern GPU. https://github.com/moderngpu/moderngpu, 2016.Google Scholar
B. Chandramouli and J. Goldstein. Patience is a virtue: Revisiting merge and sort on modern processors. SIGMOD, 2014. Google ScholarDigital Library
J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 2008. Google ScholarDigital Library
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 2015. Google ScholarDigital Library
M. Cho, D. Brand, R. Bordawekar, U. Finkler, V. Kulandaisamy, and R. Puri. Paradis: An efficient parallel algorithm for in-place radix sort. PVLDB, 2015. Google ScholarDigital Library
Cisco visual networking index: Global mobile data traffic forecast update, 2015--2020 white paper. Technical report, Cisco, 2016.Google Scholar
A. Davidson, D. Tarjan, M. Garland, and J. D. Owens. Efficient parallel merge sort for fixed and variable length keys. InPar 2012, 2012.Google Scholar
F. Dehne and H. Zaboli. Deterministic sample sort for GPUs. Parallel Processing Letters, 2012.Google Scholar
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: High performance graphics co-processor sorting for large database management. SIGMOD, 2006. Google ScholarDigital Library
G. Graefe. Implementing sorting in database systems. ACM Computing Surveys (CSUR), 2006. Google ScholarDigital Library
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In ACM SIGMOD Record, 1994. Google ScholarDigital Library
O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. ICS 2012, 2012. Google ScholarDigital Library
L. Ha, J. Krüger, and C. T. Silva. Fast four-way parallel radix sorting on gpus. Computer Graphics Forum, 2009.Google ScholarCross Ref
M. Harris, S. Sengupta, and J. D. Owens. Gpu gems 3. Parallel Prefix Sum (Scan) with CUDA, 2007.Google Scholar
B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. SC '07, 2007. Google ScholarDigital Library
M. Herf. Radix tricks. http://stereopsis.com/radix.html, 2001.Google Scholar
J. Hoberock and N. Bell. Thrust: A parallel template library. https://thrust.github.io, 2016.Google Scholar
H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. Aa-sort: A new parallel sorting algorithm for multi-core simd processors. PACT '07, 2007. Google ScholarDigital Library
H. Inoue and K. Taura. Simd- and cache-friendly algorithm for sorting an array of structures. PVLDB, 2015. Google ScholarDigital Library
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. PVLDB, 2009. Google ScholarDigital Library
C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. Cloudramsort: Fast and efficient large-scale distributed ram sort on shared-nothing cluster. SIGMOD, 2012. Google ScholarDigital Library
P. Kipfer and R. Westermann. Improved gpu sorting. GPU gems, 2:733--746, 2005.Google Scholar
J. Krueger, M. Grund, I. Jaeckel, A. Zeier, and H. Plattner. Applicability of gpu computing for efficient merge in in-memory databases. In ADMS@ VLDB, 2011.Google Scholar
N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort. IPDPS, 2010.Google ScholarCross Ref
D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for gpu computing. Parallel Processing Letters, 2011.Google Scholar
D. Merrill and NVIDIA Corporation. CUB. https://github.com/NVlabs/cub, 2016.Google Scholar
M. Najafi, M. Sadoghi, and H. A. Jacobsen. Configurable hardware-based streaming architecture using online programmable-blocks. ICDE, 2015.Google ScholarCross Ref
NVIDIA GeForce GTX 980. Whitepaper. Technical report, NVIDIA, 2014.Google Scholar
NVIDIA Tesla P100. Whitepaper. Technical report, NVIDIA, 2016.Google Scholar
O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. SIGMOD, 2014. Google ScholarDigital Library
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. IPDPS, 2009. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: A case for bandwidth oblivious simd sort. SIGMOD, 2010. Google ScholarDigital Library
A. Shahvarani and H.-A. Jacobsen. A hybrid b+-tree as solution for in-memory indexing on cpu-gpu heterogeneous computing platforms. SIGMOD, 2016. Google ScholarDigital Library
E. Sintorn and U. Assarsson. Fast parallel gpu-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008. Google ScholarDigital Library
I. Tanasic, L. Vilanova, M. Jordà, J. Cabezas, I. Gelado, N. Navarro, and W.-m. Hwu. Comparison based sorting for systems with multiple gpus. GPGPU, 2013. Google ScholarDigital Library
K. Thearling and S. Smith. An improved supercomputer sorting benchmark. SC '92, 1992. Google ScholarDigital Library
J. Wassenberg and P. Sanders. Engineering a Multi-core Radix Sort. Euro-Par 2011. 2011. Google ScholarDigital Library
X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne. High performance comparison-based sorting algorithm on many-core gpus. IPDPS, 2010.Google Scholar

Index Terms

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
1. Information systems
  1. Data management systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching

Recommendations

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and ...
Read More
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpu
heterogeneous computing
parallel sorting
radix sort
sorting
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 1,044
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Fast in-place sorting with CUDA based on bitonic sort

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing