ABSTRACT
We present a novel external sorting algorithm using graphics processors (GPUs) on large databases composed of billions of records and wide keys. Our algorithm uses the data parallelism within a GPU along with task parallelism by scheduling some of the memory-intensive and compute-intensive threads on the GPU. Our new sorting architecture provides multiple memory interfaces on the same PC -- a fast and dedicated memory interface on the GPU along with the main memory interface for CPU computations. As a result, we achieve higher memory bandwidth as compared to CPU-based algorithms running on commodity PCs. Our approach takes into account the limited communication bandwidth between the CPU and the GPU, and reduces the data communication between the two processors. Our algorithm also improves the performance of disk transfers and achieves close to peak I/O performance. We have tested the performance of our algorithm on the SortBenchmark and applied it to large databases composed of a few hundred Gigabytes of data. Our results on a 3 GHz Pentium IV PC with $300 NVIDIA 7800 GT GPU indicate a significant performance improvement over optimized CPU-based algorithms on high-end PCs with 3.6 GHz Dual Xeon processors. Our implementation is able to outperform the current high-end PennySort benchmark and results in a higher performance to price ratio. Overall, our results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.
- Clearspeed technology. (http://www.clearspeed.com), 2005.Google Scholar
- Ramesh C. Agarwal. A super scalar sort algorithm for RISC processors. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):240--246, June 1996. Google ScholarDigital Library
- Alok Aggarwal and S. Vitter Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988. Google ScholarDigital Library
- A. Ailamaki. Database architectures for new hardware. VLDB Tutorial Notes, 2004.Google Scholar
- Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. Weaving relations for cache performance. In Proceedings of the Twenty-seventh International Conference on Very Large Data Bases, pages 169--180, 2001. Google ScholarDigital Library
- Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, pages 266--277, 1999. Google ScholarDigital Library
- Anon. et al. A measure of transaction processing power. Datamation, page 112, 1 1985. Google ScholarDigital Library
- Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. SIGMOD Record (ACM Special Interest Group on Management of Data), 26(2):243--254, June 1997. Google ScholarDigital Library
- N. Bandi, C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration in commercial databases: A case study of spatial operations. VLDB, 2004.Google ScholarDigital Library
- K.E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference, 1968.Google ScholarDigital Library
- B. Baugsto, J. Griepsland, and J. Kamerbeck. Sorting large data files on poma. Proc. of COMPAR-90 VAPPIV, pages 536--547, 1990. Google ScholarDigital Library
- Peter A. Boncz, Stefan Manegold, and Martin L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the Twenty-fifth International Conference on Very Large Databases, pages 54--65, 1999. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, Cambridge, MA, 2nd edition, 2001. Google ScholarDigital Library
- D. Dewitt, J. Naughton, and D. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. Proc. of International Conference on Parallel and Distributed Information Systems, 1991. Google ScholarDigital Library
- Anonymous et al. A measure of transaction processing power. Datamation, 31(7):112--118, 1985. Google ScholarDigital Library
- E. Friend. Sorting on electronic computer systems. Journal of the ACM, 3(3):134--168, 1956. Google ScholarDigital Library
- P. Garcia and H. F. Korth. Multithreaded architectures and the sort benchmark. Proc. of First International Workshop on the Data Management on New Hardware, 2005. Google ScholarDigital Library
- N. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. Proc. of ACM SIGMOD, 2004. Google ScholarDigital Library
- N. Govindaraju and D. Manocha. Efficient relational database management on graphics processors. Proc. of ACM Workshop on Data Management on New Hardware, 2005. Google ScholarDigital Library
- N. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD, 2005. Google ScholarDigital Library
- Naga K. Govindaraju, Jim Gray, and Dinesh Manocha. Cache-efficient general purpose algorithms on GPUs. Technical report, Microsoft Research, December 2005.Google Scholar
- P. Kipfer, M. Segal, and R. Westermann. Uber ow: A gpu-based particle engine. SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2004. Google ScholarDigital Library
- D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 1973.Google Scholar
- A. LaMarca and R. Ladner. The in uence of caches on the performance of sorting. Proc. of SODA, pages 370--379, 1997. Google ScholarDigital Library
- X. Li, G. Linoff, S. Smith, C. Stanfill, and K. Thearling. A practical external ort for shared disk mpps. Proc. of SuperComputing, pages 666--675, 1993. Google ScholarDigital Library
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. What happens during a join? Dissecting CPU and memory optimization effects. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages 339--350, 2000. Google ScholarDigital Library
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB 2002: proceedings of the Twenty-Eighth International Conference on Very Large Data Bases, pages 191--202, 2002. Google ScholarDigital Library
- Shintaro Meki and Yahiko Kambayashi. Acceleration of relational database operations on vector processors. Systems and Computers in Japan, 31(8):79--88, August 2000.Google ScholarCross Ref
- S. Muthukrishnan. Data streams: Algorithms and applications. Proc. of 14th ACM-SIAM Symposium on Discrete Algorithms, 2003. http://athos.rutgers.edu/ muthu/stream-1-1.ps. Google ScholarDigital Library
- M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. Journal of the ACM, 42(4):919--933, July 1995. Google ScholarDigital Library
- Nsort: Fast parallel sorting. http://www.ordinal.com/.Google Scholar
- C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort: A risc machine sort. SIGMOD, pages 233--242, 1994. Google ScholarDigital Library
- Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and David B. Lomet. AlphaSort: A cache-sensitive parallel external sort. VLDB Journal: Very Large Data Bases, 4(4):603--627, 1995. Google ScholarDigital Library
- J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and T. Purcell. A survey of general-purpose computation on graphics hardware. 2005.Google Scholar
- T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 41--50, 2003. Google ScholarDigital Library
- Jun Rao and Kenneth A. Ross. Cache conscious indexing for decision-support in main memory. In Proceedings of the Twenty-fifth International Conference on Very Large Databases, pages 78--89, 1999. Google ScholarDigital Library
- Kenneth A. Ross. Conjunctive selection conditions in main memory. In ACM, editor, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 2002: Madison, Wisconsin, June 3-5, 2002, pages 109--120, New York, NY 10036, USA, 2002. ACM Press. ACM order number 475021. Google ScholarDigital Library
- Betty Salzberg, Alex Tsukerman, Jim Gray, Michael Stuewart, Susan Uren, and Bonnie Vaughan. FastSort: A distributed single-input single-output external sort. SIGMOD Record (ACM Special Interest Group on Management of Data), 19(2):94--101, June 1990. Google ScholarDigital Library
- Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. Cache conscious algorithms for relational query processing. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago, Chile proceedings, pages 510--521, Los Altos, CA 94022, USA, 1994. Morgan Kaufmann Publishers. Google ScholarDigital Library
- C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration for spatial selections and joins. SIGMOD, 2003. Google ScholarDigital Library
- S. Venkatasubramanian. The graphics card as a stream computer. Workshop on Management and Processing of Data Streams, 2003.Google Scholar
- J. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, pages 209--271, 2001. Google ScholarDigital Library
- J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory II: Hierarchical multilevel memories. Algorithmica, 12:148--169, 1994.Google ScholarDigital Library
- M. Zagha and G. Blelloch. Radix sort for vector multiprocessors. Proc. of SuperComputing, 1991. Google ScholarDigital Library
- Jingren Zhou and Kenneth A. Ross. Implementing database operations using SIMD instructions. In Michael Franklin, Bongki Moon, and Anastassia Ailamaki, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, June 3-6, 2002, Madison, WI, USA, pages 145--156, New York, NY 10036, USA, 2002. ACM Press. ACM order number 475020. Google ScholarDigital Library
Index Terms
- GPUTeraSort: high performance graphics co-processor sorting for large database management
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments