research-article

Efficient implementation of sorting on multi-core SIMD CPU architecture

Authors:
Jatin Chhugani

Intel Corporation

Intel Corporation
View Profile

,
Anthony D. Nguyen

Intel Corporation

Intel Corporation
View Profile

,
Victor W. Lee

Intel Corporation

Intel Corporation
View Profile

,
William Macy

Intel Corporation

Intel Corporation
View Profile

,
Mostafa Hagog

Intel Corporation

Intel Corporation
View Profile

,
Yen-Kuang Chen

Intel Corporation

Intel Corporation
View Profile

,
Akram Baransi

Intel Corporation

Intel Corporation
View Profile

,
Sanjeev Kumar

Intel Corporation

Intel Corporation
View Profile

,
Pradeep Dubey

Intel Corporation

Intel Corporation
View Profile

Proceedings of the VLDB Endowment Volume 1 Issue 2pp 1313–1324https://doi.org/10.14778/1454159.1454171

Published:01 August 2008Publication History

Proceedings of the VLDB Endowment

Abstract

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.

This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.

Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.

References

K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968.Google ScholarDigital Library
G. Bilardi and A. Nicolau. Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines. SIAM J. Comput., 18(2):216--228, 1989. Google ScholarDigital Library
Intel 64 and IA-32 architectures optimization reference manual. http://www.intel.com/products/processor/manuals.Google Scholar
M. V. de Wiel and H. Daer. Sort Performance Improvements in Oracle Database 10g Release2. An Oracle White Paper, 2005.Google Scholar
R. Francis and I. Mathieson. A Benchmark Parallel Sort for Shared Memory Multiprocessors. IEEE Transactions on Computers, 37:1619--1626, 1988. Google ScholarDigital Library
R. S. Francis, I. D. Mathieson, and L. Pannan. A Fast, Simple Algorithm to Balance a Parallel Multiway Merge. In PARLE, pages 570--581, 1993. Google ScholarDigital Library
B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: High Performance Sorting on the Cell Processor. In VLDB '07, pages 1286--1297, 2007. Google ScholarDigital Library
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management. In Proceedings of the ACM SIGMOD Conference, pages 325--336, 2006. Google ScholarDigital Library
A. Gress and G. Zachmann. GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006, April 2006. Google ScholarDigital Library
M. Gschwind. Chip Multiprocessing and the Cell Broadband Engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, 2006. Google ScholarDigital Library
H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT '07, pages 189--198, 2007. Google ScholarDigital Library
Intel Advanced Vector Extensions Programming Reference. http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel- AVX-Programming-Reference-31943302.pdf/, 2008.Google Scholar
D. E. Knuth. The Art of Computer Programming, Volume 3: (2nd ed.) Sorting and Searching. 1998. Google ScholarDigital Library
S. Lacey and R. Box. A Fast, Easy Sort. Byte Magazine, pages 315--320, 1991. Google ScholarDigital Library
W. A. Martin. Sorting. ACM Comp Surv., 3(4):147--174, 1971. Google ScholarDigital Library
T. Nakatani, S.-T. Huang, B. Arden, and S. Tripathi. K-way Bitonic Sort. IEEE Transactions on Computers, 38(2):283--288, Feb 1989. Google ScholarDigital Library
R. Parikh. Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com/articles/eng/2422.htm/, 2008.Google Scholar
T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In ACM SIGGRAPH 2005 Courses, 2005. Google ScholarDigital Library
N. R. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing 2008.Google Scholar
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. Proceedings of SIGGRAPH, 27(3):to appear, 2008. Google ScholarDigital Library
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In ACM symposium on Graphics hardware, pages 97--106, 2007. Google ScholarDigital Library
E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007.Google Scholar
P. Tsigas and Y. Zhang. A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372, 2003.Google Scholar
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of Solid-State Circuits Conference, pages 98--589, 2007.Google ScholarCross Ref

Index Terms

Efficient implementation of sorting on multi-core SIMD CPU architecture

Recommendations

Efficient aerial image simulation on multi-core SIMD CPU
ICCAD '13: Proceedings of the International Conference on Computer-Aided Design

Aerial image simulation is a fundamental problem in advanced lithography for chip fabrication. Since it requires a huge number of mathematical computations, an efficient yet accurate implementation becomes a necessity. In the literature, GPU or FPGA has ...
Read More
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Read More
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 1, Issue 2
August 2008
461 pages
ISSN:2150-8097
Editors:
Peter Buneman,
Beng Chin Ooi,
Kenneth Ross,
Gerald Weber
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2008
Published in pvldb Volume 1, Issue 2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 86
  Total Citations
  View Citations
- 2,364
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient aerial image simulation on multi-core SIMD CPU

Optimized HPL for AMD GPU and multi-core CPU usage

Efficient SIMD implementation for accelerating convolutional neural network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient aerial image simulation on multi-core SIMD CPU

Optimized HPL for AMD GPU and multi-core CPU usage

Efficient SIMD implementation for accelerating convolutional neural network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media