Article

GPUTeraSort: high performance graphics co-processor sorting for large database management

Authors:
Naga Govindaraju

UNC Chapel Hill

UNC Chapel Hill
View Profile

,
Jim Gray

Microsoft Research

Microsoft Research
View Profile

,
Ritesh Kumar

UNC Chapel Hill

UNC Chapel Hill
View Profile

,
Dinesh Manocha

UNC Chapel Hill

UNC Chapel Hill
View Profile

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataJune 2006Pages 325–336https://doi.org/10.1145/1142473.1142511

Published:27 June 2006Publication History

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Pages 325–336

ABSTRACT

We present a novel external sorting algorithm using graphics processors (GPUs) on large databases composed of billions of records and wide keys. Our algorithm uses the data parallelism within a GPU along with task parallelism by scheduling some of the memory-intensive and compute-intensive threads on the GPU. Our new sorting architecture provides multiple memory interfaces on the same PC -- a fast and dedicated memory interface on the GPU along with the main memory interface for CPU computations. As a result, we achieve higher memory bandwidth as compared to CPU-based algorithms running on commodity PCs. Our approach takes into account the limited communication bandwidth between the CPU and the GPU, and reduces the data communication between the two processors. Our algorithm also improves the performance of disk transfers and achieves close to peak I/O performance. We have tested the performance of our algorithm on the SortBenchmark and applied it to large databases composed of a few hundred Gigabytes of data. Our results on a 3 GHz Pentium IV PC with $300 NVIDIA 7800 GT GPU indicate a significant performance improvement over optimized CPU-based algorithms on high-end PCs with 3.6 GHz Dual Xeon processors. Our implementation is able to outperform the current high-end PennySort benchmark and results in a higher performance to price ratio. Overall, our results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.

References

Clearspeed technology. (http://www.clearspeed.com), 2005.Google Scholar
Ramesh C. Agarwal. A super scalar sort algorithm for RISC processors. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):240--246, June 1996. Google ScholarDigital Library
Alok Aggarwal and S. Vitter Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988. Google ScholarDigital Library
A. Ailamaki. Database architectures for new hardware. VLDB Tutorial Notes, 2004.Google Scholar
Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. Weaving relations for cache performance. In Proceedings of the Twenty-seventh International Conference on Very Large Data Bases, pages 169--180, 2001. Google ScholarDigital Library
Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, pages 266--277, 1999. Google ScholarDigital Library
Anon. et al. A measure of transaction processing power. Datamation, page 112, 1 1985. Google ScholarDigital Library
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. SIGMOD Record (ACM Special Interest Group on Management of Data), 26(2):243--254, June 1997. Google ScholarDigital Library
N. Bandi, C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration in commercial databases: A case study of spatial operations. VLDB, 2004.Google ScholarDigital Library
K.E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference, 1968.Google ScholarDigital Library
B. Baugsto, J. Griepsland, and J. Kamerbeck. Sorting large data files on poma. Proc. of COMPAR-90 VAPPIV, pages 536--547, 1990. Google ScholarDigital Library
Peter A. Boncz, Stefan Manegold, and Martin L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the Twenty-fifth International Conference on Very Large Databases, pages 54--65, 1999. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, Cambridge, MA, 2nd edition, 2001. Google ScholarDigital Library
D. Dewitt, J. Naughton, and D. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. Proc. of International Conference on Parallel and Distributed Information Systems, 1991. Google ScholarDigital Library
Anonymous et al. A measure of transaction processing power. Datamation, 31(7):112--118, 1985. Google ScholarDigital Library
E. Friend. Sorting on electronic computer systems. Journal of the ACM, 3(3):134--168, 1956. Google ScholarDigital Library
P. Garcia and H. F. Korth. Multithreaded architectures and the sort benchmark. Proc. of First International Workshop on the Data Management on New Hardware, 2005. Google ScholarDigital Library
N. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. Proc. of ACM SIGMOD, 2004. Google ScholarDigital Library
N. Govindaraju and D. Manocha. Efficient relational database management on graphics processors. Proc. of ACM Workshop on Data Management on New Hardware, 2005. Google ScholarDigital Library
N. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD, 2005. Google ScholarDigital Library
Naga K. Govindaraju, Jim Gray, and Dinesh Manocha. Cache-efficient general purpose algorithms on GPUs. Technical report, Microsoft Research, December 2005.Google Scholar
P. Kipfer, M. Segal, and R. Westermann. Uber ow: A gpu-based particle engine. SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2004. Google ScholarDigital Library
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 1973.Google Scholar
A. LaMarca and R. Ladner. The in uence of caches on the performance of sorting. Proc. of SODA, pages 370--379, 1997. Google ScholarDigital Library
X. Li, G. Linoff, S. Smith, C. Stanfill, and K. Thearling. A practical external ort for shared disk mpps. Proc. of SuperComputing, pages 666--675, 1993. Google ScholarDigital Library
Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. What happens during a join? Dissecting CPU and memory optimization effects. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages 339--350, 2000. Google ScholarDigital Library
Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB 2002: proceedings of the Twenty-Eighth International Conference on Very Large Data Bases, pages 191--202, 2002. Google ScholarDigital Library
Shintaro Meki and Yahiko Kambayashi. Acceleration of relational database operations on vector processors. Systems and Computers in Japan, 31(8):79--88, August 2000.Google ScholarCross Ref
S. Muthukrishnan. Data streams: Algorithms and applications. Proc. of 14th ACM-SIAM Symposium on Discrete Algorithms, 2003. http://athos.rutgers.edu/ muthu/stream-1-1.ps. Google ScholarDigital Library
M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. Journal of the ACM, 42(4):919--933, July 1995. Google ScholarDigital Library
Nsort: Fast parallel sorting. http://www.ordinal.com/.Google Scholar
C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort: A risc machine sort. SIGMOD, pages 233--242, 1994. Google ScholarDigital Library
Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and David B. Lomet. AlphaSort: A cache-sensitive parallel external sort. VLDB Journal: Very Large Data Bases, 4(4):603--627, 1995. Google ScholarDigital Library
J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and T. Purcell. A survey of general-purpose computation on graphics hardware. 2005.Google Scholar
T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 41--50, 2003. Google ScholarDigital Library
Jun Rao and Kenneth A. Ross. Cache conscious indexing for decision-support in main memory. In Proceedings of the Twenty-fifth International Conference on Very Large Databases, pages 78--89, 1999. Google ScholarDigital Library
Kenneth A. Ross. Conjunctive selection conditions in main memory. In ACM, editor, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 2002: Madison, Wisconsin, June 3-5, 2002, pages 109--120, New York, NY 10036, USA, 2002. ACM Press. ACM order number 475021. Google ScholarDigital Library
Betty Salzberg, Alex Tsukerman, Jim Gray, Michael Stuewart, Susan Uren, and Bonnie Vaughan. FastSort: A distributed single-input single-output external sort. SIGMOD Record (ACM Special Interest Group on Management of Data), 19(2):94--101, June 1990. Google ScholarDigital Library
Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. Cache conscious algorithms for relational query processing. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago, Chile proceedings, pages 510--521, Los Altos, CA 94022, USA, 1994. Morgan Kaufmann Publishers. Google ScholarDigital Library
C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration for spatial selections and joins. SIGMOD, 2003. Google ScholarDigital Library
S. Venkatasubramanian. The graphics card as a stream computer. Workshop on Management and Processing of Data Streams, 2003.Google Scholar
J. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, pages 209--271, 2001. Google ScholarDigital Library
J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory II: Hierarchical multilevel memories. Algorithmica, 12:148--169, 1994.Google ScholarDigital Library
M. Zagha and G. Blelloch. Radix sort for vector multiprocessors. Proc. of SuperComputing, 1991. Google ScholarDigital Library
Jingren Zhou and Kenneth A. Ross. Implementing database operations using SIMD instructions. In Michael Franklin, Bongki Moon, and Anastassia Ailamaki, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, June 3-6, 2002, Madison, WI, USA, pages 145--156, New York, NY 10036, USA, 2002. ACM Press. ACM order number 475020. Google ScholarDigital Library

Index Terms

GPUTeraSort: high performance graphics co-processor sorting for large database management
1. Information systems
  1. Data management systems

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
June 2006
830 pages
ISBN:1595934340
DOI:10.1145/1142473
General Chairs:
Clement Yu
University of Illinois at Chicago
,
Peter Scheuermann
Northwestern University
,
Program Chair:
Surajit Chaudhuri
Microsoft Research
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 258
  Total Citations
  View Citations
- 1,910
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPUTeraSort: high performance graphics co-processor sorting for large database management

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPUTeraSort: high performance graphics co-processor sorting for large database management

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media