skip to main content
research-article

Efficient implementation of sorting on multi-core SIMD CPU architecture

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.

This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.

Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.

References

  1. K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Bilardi and A. Nicolau. Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines. SIAM J. Comput., 18(2):216--228, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Intel 64 and IA-32 architectures optimization reference manual. http://www.intel.com/products/processor/manuals.Google ScholarGoogle Scholar
  4. M. V. de Wiel and H. Daer. Sort Performance Improvements in Oracle Database 10g Release2. An Oracle White Paper, 2005.Google ScholarGoogle Scholar
  5. R. Francis and I. Mathieson. A Benchmark Parallel Sort for Shared Memory Multiprocessors. IEEE Transactions on Computers, 37:1619--1626, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. S. Francis, I. D. Mathieson, and L. Pannan. A Fast, Simple Algorithm to Balance a Parallel Multiway Merge. In PARLE, pages 570--581, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: High Performance Sorting on the Cell Processor. In VLDB '07, pages 1286--1297, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management. In Proceedings of the ACM SIGMOD Conference, pages 325--336, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gress and G. Zachmann. GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006, April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Gschwind. Chip Multiprocessing and the Cell Broadband Engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT '07, pages 189--198, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Intel Advanced Vector Extensions Programming Reference. http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel- AVX-Programming-Reference-31943302.pdf/, 2008.Google ScholarGoogle Scholar
  13. D. E. Knuth. The Art of Computer Programming, Volume 3: (2nd ed.) Sorting and Searching. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Lacey and R. Box. A Fast, Easy Sort. Byte Magazine, pages 315--320, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. A. Martin. Sorting. ACM Comp Surv., 3(4):147--174, 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Nakatani, S.-T. Huang, B. Arden, and S. Tripathi. K-way Bitonic Sort. IEEE Transactions on Computers, 38(2):283--288, Feb 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Parikh. Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com/articles/eng/2422.htm/, 2008.Google ScholarGoogle Scholar
  18. T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In ACM SIGGRAPH 2005 Courses, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. R. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing 2008.Google ScholarGoogle Scholar
  20. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. Proceedings of SIGGRAPH, 27(3):to appear, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In ACM symposium on Graphics hardware, pages 97--106, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007.Google ScholarGoogle Scholar
  23. P. Tsigas and Y. Zhang. A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372, 2003.Google ScholarGoogle Scholar
  24. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of Solid-State Circuits Conference, pages 98--589, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Efficient implementation of sorting on multi-core SIMD CPU architecture

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader