skip to main content
10.1145/2458523.2458524acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Comparison based sorting for systems with multiple GPUs

Published:16 March 2013Publication History

ABSTRACT

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system.

In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.

References

  1. P. Jetley, F. Gioachin, C. Mendes, L. Kale, and T. Quinn, "Massively parallel cosmological simulations with changa," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1--12.Google ScholarGoogle Scholar
  2. N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09, 2009, pp. 18:1--18:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, "Fast bvh construction on gpus," Computer Graphics Forum, vol. 28, no. 2, pp. 375--384, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Graefe, "Implementing sorting in database systems," ACM Comput. Surv., vol. 38, no. 3, Sep. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Wu, B. Zhang, and M. Hsu, "Clustering billions of data points using gpus," in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, ser. UCHPC-MAW '09, 2009, pp. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08, 2008, pp. 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Stuart and J. Owens, "Multi-gpu mapreduce on gpu clusters," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 1068--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, may 2008.Google ScholarGoogle ScholarCross RefCross Ref
  9. E. Sintorn and U. Assarsson, "Fast parallel gpu-sorting using a hybrid algorithm," Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1381--1388, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore gpus," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, may 2009, pp. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne, "High performance comparison-based sorting algorithm on many-core gpus," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google ScholarGoogle Scholar
  12. N. Leischner, V. Osipov, and P. Sanders, "Gpu sample sort," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.Google ScholarGoogle Scholar
  13. D. Cederman and P. Tsigas, "Gpu-quicksort: A practical quicksort algorithm for graphics processors," J. Exp. Algorithmics, vol. 14, pp. 4:1.4--4:1.24, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Ye, Z. Du, and D. A. Bader, "Gpumemsort: A high performance graphic co-processors sorting algorithm for large scale in-memory data," 2010.Google ScholarGoogle Scholar
  15. H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, "Parallel external sorting for cuda-enabled gpus with load balancing and low transfer overhead," Parallel and Distributed Processing Workshops and PhD Forum, 2011 IEEE International Symposium on, vol. 0, pp. 1--8, 2010.Google ScholarGoogle Scholar
  16. NVIDIA, CUDA C Programming Guide, 2012.Google ScholarGoogle Scholar
  17. C. A. R. Hoare, "Quicksort," The Computer Journal, vol. 5, no. 1, pp. 10--16, January 1962.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. Knuth, The Art of Computer Programming. Addison-Wesley, 1998, vol. 3 Sorting and Searching, noted by the author in p. 158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Bakkum and K. Skadron, "Accelerating sql database operations on a gpu with cuda," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU '10, 2010, pp. 94--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. "Amd fusion whitepaper." {Online}. Available: http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdfGoogle ScholarGoogle Scholar
  21. "Intel quickpath interconnect," 2012. {Online}. Available: http://www.intel.com/technology/quickpathGoogle ScholarGoogle Scholar
  22. "Hypertransport interconnect," 2012. {Online}. Available: http://www.hypertransport.orgGoogle ScholarGoogle Scholar
  23. "Heterogeneous system architecture," 2012. {Online}. Available: www.hsafoundation.comGoogle ScholarGoogle Scholar
  24. I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS '10, 2010, pp. 347--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W. mei Hwu, "Gpu clusters for high-performance computing," in Workshop on Parallel Programming on Accelerator Clusters. IEEE CLUSTER '09, 31 2009-sept. 4 2009, pp. 1--8.Google ScholarGoogle Scholar
  26. NVIDIA, "Thrust parallel algorithms library," 2012. {Online}. Available: http://thrust.github.comGoogle ScholarGoogle Scholar
  27. T. Hagerup and C. Rüb, "Optimal merging and sorting on the erew pram," Information Processing Letters, vol. 33, no. 4, pp. 181--185, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Singler, P. Sanders, and F. Putze, "Mcstl: The multi-core standard template library," in Euro-Par 2007 Parallel Processing, ser. Lecture Notes in Computer Science, A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Springer Berlin/Heidelberg, 2007, vol. 4641, pp. 682--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. R. Musser, "Introspective sorting and selection algorithms," Software: Practice and Experience, vol. 27, no. 8, pp. 983--993, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, spring joint computer conference, ser. AFIPS '68 (Spring), 1968, pp. 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Cole, "Parallel merge sort," in Foundations of Computer Science, 1986., 27th Annual Symposium on, oct. 1986, pp. 511--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, vol. 31, pp. 135--167, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  33. E. Solomonik and L. Kale, "Highly scalable parallel sorting," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--12.Google ScholarGoogle Scholar
  34. N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey, "Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort," in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD '10, 2010, pp. 351--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, "Efficient parallel merge sort for fixed and variable length keys," in Proceedings of Innovative Parallel Computing (InPar '12), May 2012.Google ScholarGoogle Scholar
  36. O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing, ser. ICS '12, 2012, pp. 331--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. R. Helman, J. JáJá, and D. A. Bader, "A new deterministic parallel sorting algorithm with an experimental evaluation," J. Exp. Algorithmics, vol. 3, Sep. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Comparison based sorting for systems with multiple GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
      March 2013
      156 pages
      ISBN:9781450320177
      DOI:10.1145/2458523

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 March 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      GPGPU-6 Paper Acceptance Rate15of37submissions,41%Overall Acceptance Rate57of129submissions,44%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader