research-article

Comparison based sorting for systems with multiple GPUs

Authors:
Ivan Tanasic

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Lluís Vilanova

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Marc Jordà

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Javier Cabezas

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Isaac Gelado

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Nacho Navarro

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

,
Wen-mei Hwu

University of Illinois

University of Illinois
View Profile

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsMarch 2013Pages 1–11https://doi.org/10.1145/2458523.2458524

Published:16 March 2013Publication History

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Pages 1–11

ABSTRACT

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system.

In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.

References

P. Jetley, F. Gioachin, C. Mendes, L. Kale, and T. Quinn, "Massively parallel cosmological simulations with changa," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1--12.Google Scholar
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09, 2009, pp. 18:1--18:11. Google ScholarDigital Library
C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, "Fast bvh construction on gpus," Computer Graphics Forum, vol. 28, no. 2, pp. 375--384, 2009.Google ScholarCross Ref
G. Graefe, "Implementing sorting in database systems," ACM Comput. Surv., vol. 38, no. 3, Sep. 2006. Google ScholarDigital Library
R. Wu, B. Zhang, and M. Hsu, "Clustering billions of data points using gpus," in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, ser. UCHPC-MAW '09, 2009, pp. 1--6. Google ScholarDigital Library
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08, 2008, pp. 260--269. Google ScholarDigital Library
J. Stuart and J. Owens, "Multi-gpu mapreduce on gpu clusters," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 1068--1079. Google ScholarDigital Library
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, may 2008.Google ScholarCross Ref
E. Sintorn and U. Assarsson, "Fast parallel gpu-sorting using a hybrid algorithm," Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1381--1388, 2008. Google ScholarDigital Library
N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore gpus," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, may 2009, pp. 1--10. Google ScholarDigital Library
X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne, "High performance comparison-based sorting algorithm on many-core gpus," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google Scholar
N. Leischner, V. Osipov, and P. Sanders, "Gpu sample sort," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google Scholar
D. Cederman and P. Tsigas, "Gpu-quicksort: A practical quicksort algorithm for graphics processors," J. Exp. Algorithmics, vol. 14, pp. 4:1.4--4:1.24, Jan. 2010. Google ScholarDigital Library
Y. Ye, Z. Du, and D. A. Bader, "Gpumemsort: A high performance graphic co-processors sorting algorithm for large scale in-memory data," 2010.Google Scholar
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, "Parallel external sorting for cuda-enabled gpus with load balancing and low transfer overhead," Parallel and Distributed Processing Workshops and PhD Forum, 2011 IEEE International Symposium on, vol. 0, pp. 1--8, 2010.Google Scholar
NVIDIA, CUDA C Programming Guide, 2012.Google Scholar
C. A. R. Hoare, "Quicksort," The Computer Journal, vol. 5, no. 1, pp. 10--16, January 1962.Google ScholarCross Ref
D. Knuth, The Art of Computer Programming. Addison-Wesley, 1998, vol. 3 Sorting and Searching, noted by the author in p. 158. Google ScholarDigital Library
P. Bakkum and K. Skadron, "Accelerating sql database operations on a gpu with cuda," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU '10, 2010, pp. 94--103. Google ScholarDigital Library
"Amd fusion whitepaper." {Online}. Available: http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdfGoogle Scholar
"Intel quickpath interconnect," 2012. {Online}. Available: http://www.intel.com/technology/quickpathGoogle Scholar
"Hypertransport interconnect," 2012. {Online}. Available: http://www.hypertransport.orgGoogle Scholar
"Heterogeneous system architecture," 2012. {Online}. Available: www.hsafoundation.comGoogle Scholar
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS '10, 2010, pp. 347--358. Google ScholarDigital Library
V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W. mei Hwu, "Gpu clusters for high-performance computing," in Workshop on Parallel Programming on Accelerator Clusters. IEEE CLUSTER '09, 31 2009-sept. 4 2009, pp. 1--8.Google Scholar
NVIDIA, "Thrust parallel algorithms library," 2012. {Online}. Available: http://thrust.github.comGoogle Scholar
T. Hagerup and C. Rüb, "Optimal merging and sorting on the erew pram," Information Processing Letters, vol. 33, no. 4, pp. 181--185, 1989. Google ScholarDigital Library
J. Singler, P. Sanders, and F. Putze, "Mcstl: The multi-core standard template library," in Euro-Par 2007 Parallel Processing, ser. Lecture Notes in Computer Science, A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Springer Berlin/Heidelberg, 2007, vol. 4641, pp. 682--694. Google ScholarDigital Library
D. R. Musser, "Introspective sorting and selection algorithms," Software: Practice and Experience, vol. 27, no. 8, pp. 983--993, 1997. Google ScholarDigital Library
K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, spring joint computer conference, ser. AFIPS '68 (Spring), 1968, pp. 307--314. Google ScholarDigital Library
R. Cole, "Parallel merge sort," in Foundations of Computer Science, 1986., 27th Annual Symposium on, oct. 1986, pp. 511--516. Google ScholarDigital Library
G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, vol. 31, pp. 135--167, 1998.Google ScholarCross Ref
E. Solomonik and L. Kale, "Highly scalable parallel sorting," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--12.Google Scholar
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey, "Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort," in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD '10, 2010, pp. 351--362. Google ScholarDigital Library
A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, "Efficient parallel merge sort for fixed and variable length keys," in Proceedings of Innovative Parallel Computing (InPar '12), May 2012.Google Scholar
O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing, ser. ICS '12, 2012, pp. 331--340. Google ScholarDigital Library
D. R. Helman, J. JáJá, and D. A. Bader, "A new deterministic parallel sorting algorithm with an experimental evaluation," J. Exp. Algorithmics, vol. 3, Sep. 1998. Google ScholarDigital Library

Index Terms

Comparison based sorting for systems with multiple GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Read More
Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

We present a highly parallel CUDA kernel based on the Lattice Monte Carlo (LMC) method for transient thermal conduction, which achieves a peak acceleration of more than 100x over a single-threaded Fortran version. A number of memory and branching ...
Read More
Multiphase LBM Distributed over Multiple GPUs
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

A parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios is described in this paper. Validation runs studying the terminal velocity of a rising bubble under the effect of gravity show good ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
March 2013
156 pages
ISBN:9781450320177
DOI:10.1145/2458523
Editors:
John Cavazos
University of Delaware
,
Xiang Gong,
David Kaeli
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CUDA
GPU
parallel
sorting
Qualifiers
- research-article
Conference

Acceptance Rates
GPGPU-6 Paper Acceptance Rate15of37submissions,41%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 503
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparison based sorting for systems with multiple GPUs

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast in-place sorting with CUDA based on bitonic sort

Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations

Multiphase LBM Distributed over Multiple GPUs