ABSTRACT
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system.
In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.
- P. Jetley, F. Gioachin, C. Mendes, L. Kale, and T. Quinn, "Massively parallel cosmological simulations with changa," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1--12.Google Scholar
- N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09, 2009, pp. 18:1--18:11. Google ScholarDigital Library
- C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, "Fast bvh construction on gpus," Computer Graphics Forum, vol. 28, no. 2, pp. 375--384, 2009.Google ScholarCross Ref
- G. Graefe, "Implementing sorting in database systems," ACM Comput. Surv., vol. 38, no. 3, Sep. 2006. Google ScholarDigital Library
- R. Wu, B. Zhang, and M. Hsu, "Clustering billions of data points using gpus," in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, ser. UCHPC-MAW '09, 2009, pp. 1--6. Google ScholarDigital Library
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08, 2008, pp. 260--269. Google ScholarDigital Library
- J. Stuart and J. Owens, "Multi-gpu mapreduce on gpu clusters," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 1068--1079. Google ScholarDigital Library
- J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, may 2008.Google ScholarCross Ref
- E. Sintorn and U. Assarsson, "Fast parallel gpu-sorting using a hybrid algorithm," Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1381--1388, 2008. Google ScholarDigital Library
- N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore gpus," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, may 2009, pp. 1--10. Google ScholarDigital Library
- X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne, "High performance comparison-based sorting algorithm on many-core gpus," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google Scholar
- N. Leischner, V. Osipov, and P. Sanders, "Gpu sample sort," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google Scholar
- D. Cederman and P. Tsigas, "Gpu-quicksort: A practical quicksort algorithm for graphics processors," J. Exp. Algorithmics, vol. 14, pp. 4:1.4--4:1.24, Jan. 2010. Google ScholarDigital Library
- Y. Ye, Z. Du, and D. A. Bader, "Gpumemsort: A high performance graphic co-processors sorting algorithm for large scale in-memory data," 2010.Google Scholar
- H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, "Parallel external sorting for cuda-enabled gpus with load balancing and low transfer overhead," Parallel and Distributed Processing Workshops and PhD Forum, 2011 IEEE International Symposium on, vol. 0, pp. 1--8, 2010.Google Scholar
- NVIDIA, CUDA C Programming Guide, 2012.Google Scholar
- C. A. R. Hoare, "Quicksort," The Computer Journal, vol. 5, no. 1, pp. 10--16, January 1962.Google ScholarCross Ref
- D. Knuth, The Art of Computer Programming. Addison-Wesley, 1998, vol. 3 Sorting and Searching, noted by the author in p. 158. Google ScholarDigital Library
- P. Bakkum and K. Skadron, "Accelerating sql database operations on a gpu with cuda," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU '10, 2010, pp. 94--103. Google ScholarDigital Library
- "Amd fusion whitepaper." {Online}. Available: http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdfGoogle Scholar
- "Intel quickpath interconnect," 2012. {Online}. Available: http://www.intel.com/technology/quickpathGoogle Scholar
- "Hypertransport interconnect," 2012. {Online}. Available: http://www.hypertransport.orgGoogle Scholar
- "Heterogeneous system architecture," 2012. {Online}. Available: www.hsafoundation.comGoogle Scholar
- I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS '10, 2010, pp. 347--358. Google ScholarDigital Library
- V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W. mei Hwu, "Gpu clusters for high-performance computing," in Workshop on Parallel Programming on Accelerator Clusters. IEEE CLUSTER '09, 31 2009-sept. 4 2009, pp. 1--8.Google Scholar
- NVIDIA, "Thrust parallel algorithms library," 2012. {Online}. Available: http://thrust.github.comGoogle Scholar
- T. Hagerup and C. Rüb, "Optimal merging and sorting on the erew pram," Information Processing Letters, vol. 33, no. 4, pp. 181--185, 1989. Google ScholarDigital Library
- J. Singler, P. Sanders, and F. Putze, "Mcstl: The multi-core standard template library," in Euro-Par 2007 Parallel Processing, ser. Lecture Notes in Computer Science, A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Springer Berlin/Heidelberg, 2007, vol. 4641, pp. 682--694. Google ScholarDigital Library
- D. R. Musser, "Introspective sorting and selection algorithms," Software: Practice and Experience, vol. 27, no. 8, pp. 983--993, 1997. Google ScholarDigital Library
- K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, spring joint computer conference, ser. AFIPS '68 (Spring), 1968, pp. 307--314. Google ScholarDigital Library
- R. Cole, "Parallel merge sort," in Foundations of Computer Science, 1986., 27th Annual Symposium on, oct. 1986, pp. 511--516. Google ScholarDigital Library
- G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, vol. 31, pp. 135--167, 1998.Google ScholarCross Ref
- E. Solomonik and L. Kale, "Highly scalable parallel sorting," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--12.Google Scholar
- N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey, "Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort," in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD '10, 2010, pp. 351--362. Google ScholarDigital Library
- A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, "Efficient parallel merge sort for fixed and variable length keys," in Proceedings of Innovative Parallel Computing (InPar '12), May 2012.Google Scholar
- O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing, ser. ICS '12, 2012, pp. 331--340. Google ScholarDigital Library
- D. R. Helman, J. JáJá, and D. A. Bader, "A new deterministic parallel sorting algorithm with an experimental evaluation," J. Exp. Algorithmics, vol. 3, Sep. 1998. Google ScholarDigital Library
Index Terms
- Comparison based sorting for systems with multiple GPUs
Recommendations
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part IState of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis CompanionWe present a highly parallel CUDA kernel based on the Lattice Monte Carlo (LMC) method for transient thermal conduction, which achieves a peak acceleration of more than 100x over a single-threaded Fortran version. A number of memory and branching ...
Multiphase LBM Distributed over Multiple GPUs
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster ComputingA parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios is described in this paper. Validation runs studying the terminal velocity of a rising bubble under the effect of gravity show good ...
Comments