Abstract
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.
This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.
Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.
- K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968.Google ScholarDigital Library
- G. Bilardi and A. Nicolau. Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines. SIAM J. Comput., 18(2):216--228, 1989. Google ScholarDigital Library
- Intel 64 and IA-32 architectures optimization reference manual. http://www.intel.com/products/processor/manuals.Google Scholar
- M. V. de Wiel and H. Daer. Sort Performance Improvements in Oracle Database 10g Release2. An Oracle White Paper, 2005.Google Scholar
- R. Francis and I. Mathieson. A Benchmark Parallel Sort for Shared Memory Multiprocessors. IEEE Transactions on Computers, 37:1619--1626, 1988. Google ScholarDigital Library
- R. S. Francis, I. D. Mathieson, and L. Pannan. A Fast, Simple Algorithm to Balance a Parallel Multiway Merge. In PARLE, pages 570--581, 1993. Google ScholarDigital Library
- B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: High Performance Sorting on the Cell Processor. In VLDB '07, pages 1286--1297, 2007. Google ScholarDigital Library
- N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management. In Proceedings of the ACM SIGMOD Conference, pages 325--336, 2006. Google ScholarDigital Library
- A. Gress and G. Zachmann. GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006, April 2006. Google ScholarDigital Library
- M. Gschwind. Chip Multiprocessing and the Cell Broadband Engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, 2006. Google ScholarDigital Library
- H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT '07, pages 189--198, 2007. Google ScholarDigital Library
- Intel Advanced Vector Extensions Programming Reference. http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel- AVX-Programming-Reference-31943302.pdf/, 2008.Google Scholar
- D. E. Knuth. The Art of Computer Programming, Volume 3: (2nd ed.) Sorting and Searching. 1998. Google ScholarDigital Library
- S. Lacey and R. Box. A Fast, Easy Sort. Byte Magazine, pages 315--320, 1991. Google ScholarDigital Library
- W. A. Martin. Sorting. ACM Comp Surv., 3(4):147--174, 1971. Google ScholarDigital Library
- T. Nakatani, S.-T. Huang, B. Arden, and S. Tripathi. K-way Bitonic Sort. IEEE Transactions on Computers, 38(2):283--288, Feb 1989. Google ScholarDigital Library
- R. Parikh. Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com/articles/eng/2422.htm/, 2008.Google Scholar
- T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In ACM SIGGRAPH 2005 Courses, 2005. Google ScholarDigital Library
- N. R. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing 2008.Google Scholar
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. Proceedings of SIGGRAPH, 27(3):to appear, 2008. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In ACM symposium on Graphics hardware, pages 97--106, 2007. Google ScholarDigital Library
- E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007.Google Scholar
- P. Tsigas and Y. Zhang. A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372, 2003.Google Scholar
- S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of Solid-State Circuits Conference, pages 98--589, 2007.Google ScholarCross Ref
Index Terms
- Efficient implementation of sorting on multi-core SIMD CPU architecture
Recommendations
Efficient aerial image simulation on multi-core SIMD CPU
ICCAD '13: Proceedings of the International Conference on Computer-Aided DesignAerial image simulation is a fundamental problem in advanced lithography for chip fabrication. Since it requires a huge number of mathematical computations, an efficient yet accurate implementation becomes a necessity. In the literature, GPU or FPGA has ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingConvolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Comments