research-article

Relational query coprocessing on graphics processors

Authors:
Bingsheng He

Microsoft Research Asia

Microsoft Research Asia
View Profile

,
Mian Lu

Hong Kong University of Science and Technology

Hong Kong University of Science and Technology
View Profile

,
Ke Yang

Microsoft Corporation

Microsoft Corporation
View Profile

,
Rui Fang

Highbridge Capital Management LLC, USA

Highbridge Capital Management LLC, USA
View Profile

,
Naga K. Govindaraju

Microsoft Corporation

Microsoft Corporation
View Profile

,
Qiong Luo

Hong Kong University of Science and Technology

Hong Kong University of Science and Technology
View Profile

,
Pedro V. Sander

Hong Kong University of Science and Technology

Hong Kong University of Science and Technology
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 34 Issue 4Article No.: 21pp 1–39https://doi.org/10.1145/1620585.1620588

Published:14 December 2009Publication History

ACM Transactions on Database Systems

Abstract

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries.

In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

References

Abadi, D., Madden, S., and Ferreira, M. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'06). ACM, New York, NY, 671--682. Google ScholarDigital Library
Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: Where does time go&quest; In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 266--277. Google ScholarDigital Library
Ailamaki, A., Govindaraju, N., Harizopoulos, S., and Manocha, D. 2006. Query co-processing on commodity processors. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Google ScholarDigital Library
Bandi, N., Sun, C., Agrawal, D., and Abbadi, A. E. 2004. Hardware acceleration in commercial databases: a case study of spatial operations. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 1021--1032. Google ScholarDigital Library
Blelloch, G. E. 1990. Prefix sums and their applications. Tech. rep. CMU-CS-90-190, Carnegie Mellan University.Google Scholar
Blythe, D. 2006. The direct3d 10 system. SIGGRAPH, ACM Press, NY. Google ScholarDigital Library
Boncz, P. A., Manegold, S., and Kersten, M. L. 1999. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 54--65. Google ScholarDigital Library
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for gpus: Stream computing on graphics hardware. SIGGRAPH, ACM Press, NY. Google ScholarDigital Library
Cieslewicz, J., Berry, J., Hendrickson, B., and Ross, K. A. 2006. Realizing parallelism in database operations: insights from a massively multithreaded architecture. In Proceedings of the 2nd International Workshop on Data Management on New Hardware (DaMoN). ACM, New York. Google ScholarDigital Library
Cieslewicz, J. and Ross, K. A. 2007. Adaptive aggregation on chip multiprocessors. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 339--350. Google ScholarDigital Library
DeWitt, D. and Gray, J. 1992. Parallel database systems: the future of high performance database systems. Comm. ACM 35, 6, 85--98. Google ScholarDigital Library
Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, Washington, DC, 285. Google ScholarDigital Library
Gedik, B., Bordawekar, R. R., and Yu, P. S. 2007. Cellsort: High performance sorting on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 1286--1297. Google ScholarDigital Library
Gedik, B., Yu, P. S., and Bordawekar, R. R. 2007. Executing stream joins on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 363--374. Google ScholarDigital Library
Gold, B., Ailamaki, A., Huston, L., and Falsafi, B. 2005. Accelerating database operators using a network processor. In Proceedings of the 1st International Workshop on Data Management on New Hardware (DaMoN). ACM, New York, NY, 1. Google ScholarDigital Library
Govindaraju, N., Gray, J., Kumar, R., and Manocha, D. 2006. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 325--336. Google ScholarDigital Library
Govindaraju, N., Lloyd, B., Wang, W., Lin, M., and Manocha, D. 2004. Fast computation of database operations using graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 215--226. Google ScholarDigital Library
Govindaraju, N. K., Raghuvanshi, N., and Manocha, D. 2005. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, NY, 611--622. Google ScholarDigital Library
Graefe, G. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2, 73--169. Google ScholarDigital Library
Harizopoulos, S., Abadi, D. J., Madden, S., and Stonebraker, M. 2008. OLTP through the looking glass, and what we found there. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 981--992. Google ScholarDigital Library
Harris, M., Owens, J., Sengupta, S., Zhang, Y., and Davidson, A. 2007. CUDPP: CUDA data parallel primitives library.Google Scholar
He, B., Govindaraju, N. K., Luo, Q., and Smith, B. 2007. Efficient gather and scatter operations on graphics processors. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 1--12. Google ScholarDigital Library
He, B. and Luo, Q. 2008. Cache-oblivious databases: Limitations and opportunities. ACM Trans. Datab. Syst. 33, 2, 1--42. Google ScholarDigital Library
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., and Sander, P. 2008. Relational joins on graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 511--524. Google ScholarDigital Library
Héman, S., Nes, N., Zukowski, M., and Boncz, P. 2007. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN). ACM, New York, NY, 1--6. Google ScholarDigital Library
Hong, W. and Stonebraker, M. 1991. Optimization of parallel query execution plans in xprs. In Proceedings of the 1st International Conference on Parallel and Distributed Information Systems (PDIS). IEEE Computer Society Press, Los Alamitos, CA, 218--225. Google ScholarDigital Library
Johnson, R., Harizopoulos, S., Hardavellas, N., Sabirli, K., Pandis, I., Ailamaki, A., Mancheril, N. G., and Falsafi, B. 2007. To share or not to share&quest; In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 351--362. Google ScholarDigital Library
LaMarca, A. and Ladner, R. E. 1997. The influence of caches on the performance of sorting. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, PA, 370--379. Google ScholarDigital Library
Lieberman, M. D., Sankaranarayanan, J., and Samet, H. 2008. A fast similarity join algorithm using graphics processing units. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
Liu, B. and Rundensteiner, E. A. 2005. Revisiting pipelined parallelism in multi-join query processing. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 829--840. Google ScholarDigital Library
Lu, H., Tan, K.-L., and Sahn, M.-C. 1990. Hash-based join algorithms for multiprocessor computers with shared memory. In Proceedings of the 16th International Conference on Very Large Databases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 198--209. Google ScholarDigital Library
Manegold, S., Boncz, P., and Kersten, M. 2002. Generic database cost models for hierarchical memory systems. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). Google ScholarDigital Library
Nguyen, H. 2008. GPU gems 3. Addison-Wesley. Google ScholarDigital Library
Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26.Google Scholar
Rao, J. and Ross, K. A. 1999. Cache conscious indexing for decision-support in main memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 78--89. Google ScholarDigital Library
Schneider, D. A. and DeWitt, D. J. 1989. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. SIGMOD Rec. 18, 2, 110--121. Google ScholarDigital Library
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM Press, New York, NY, 23--34. Google ScholarDigital Library
Sengupta, S., Harris, M., Zhang, Y., and Owens, J. D. 2007. Scan primitives for gpu computing. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. Google ScholarDigital Library
Shao, M., Ailamaki, A., and Falsafi, B. 2005. DBmbench: Fast and accurate database workload representation on modern microarchitecture. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). IBM Press, 254--267. Google ScholarDigital Library
Shatdal, A., Kant, C., and Naughton, J. F. 1994. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 510--521. Google ScholarDigital Library
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., and Zdonik, S. 2005. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 553--564. Google ScholarDigital Library
Sun, C., Agrawal, D., and Abbadi, A. E. 2003. Hardware acceleration for spatial selections and joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 455--466. Google ScholarDigital Library
Tarditi, D., Puri, S., and Oglesby, J. 2006. Accelerator: using data parallelism to program GPUS for general-purpose uses. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
Zhou, J., Cieslewicz, J., Ross, K. A., and Shah, M. 2005. Improving database performance on simultaneous multithreading processors. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 49--60. Google ScholarDigital Library
Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). IEEE Computer Society, Washington, DC, 59. Google ScholarDigital Library

Index Terms

Relational query coprocessing on graphics processors
1. Information systems
  1. Data management systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Relational joins on graphics processors
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

We present a novel design and implementation of relational join algorithms for new-generation graphics processing units (GPUs). The most recent GPU features include support for writing to random memory locations, efficient inter-processor communication, ...
Read More
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
Efficient gather and scatter operations on graphics processors
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs).
...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 34, Issue 4
December 2009
271 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1620585
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 December 2009
- Accepted: 1 August 2009
- Revised: 1 March 2009
- Received: 1 August 2008
Published in tods Volume 34, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Relational database
graphics processors
join
parallel processing
primitive
sort
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 205
  Total Citations
  View Citations
- 1,856
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Relational query coprocessing on graphics processors

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Relational joins on graphics processors

A performance study of general-purpose applications on graphics processors using CUDA

Efficient gather and scatter operations on graphics processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Relational query coprocessing on graphics processors

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Relational joins on graphics processors

A performance study of general-purpose applications on graphics processors using CUDA

Efficient gather and scatter operations on graphics processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media