Abstract
Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries.
In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.
- Abadi, D., Madden, S., and Ferreira, M. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'06). ACM, New York, NY, 671--682. Google ScholarDigital Library
- Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 266--277. Google ScholarDigital Library
- Ailamaki, A., Govindaraju, N., Harizopoulos, S., and Manocha, D. 2006. Query co-processing on commodity processors. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Google ScholarDigital Library
- Bandi, N., Sun, C., Agrawal, D., and Abbadi, A. E. 2004. Hardware acceleration in commercial databases: a case study of spatial operations. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 1021--1032. Google ScholarDigital Library
- Blelloch, G. E. 1990. Prefix sums and their applications. Tech. rep. CMU-CS-90-190, Carnegie Mellan University.Google Scholar
- Blythe, D. 2006. The direct3d 10 system. SIGGRAPH, ACM Press, NY. Google ScholarDigital Library
- Boncz, P. A., Manegold, S., and Kersten, M. L. 1999. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 54--65. Google ScholarDigital Library
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for gpus: Stream computing on graphics hardware. SIGGRAPH, ACM Press, NY. Google ScholarDigital Library
- Cieslewicz, J., Berry, J., Hendrickson, B., and Ross, K. A. 2006. Realizing parallelism in database operations: insights from a massively multithreaded architecture. In Proceedings of the 2nd International Workshop on Data Management on New Hardware (DaMoN). ACM, New York. Google ScholarDigital Library
- Cieslewicz, J. and Ross, K. A. 2007. Adaptive aggregation on chip multiprocessors. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 339--350. Google ScholarDigital Library
- DeWitt, D. and Gray, J. 1992. Parallel database systems: the future of high performance database systems. Comm. ACM 35, 6, 85--98. Google ScholarDigital Library
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, Washington, DC, 285. Google ScholarDigital Library
- Gedik, B., Bordawekar, R. R., and Yu, P. S. 2007. Cellsort: High performance sorting on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 1286--1297. Google ScholarDigital Library
- Gedik, B., Yu, P. S., and Bordawekar, R. R. 2007. Executing stream joins on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 363--374. Google ScholarDigital Library
- Gold, B., Ailamaki, A., Huston, L., and Falsafi, B. 2005. Accelerating database operators using a network processor. In Proceedings of the 1st International Workshop on Data Management on New Hardware (DaMoN). ACM, New York, NY, 1. Google ScholarDigital Library
- Govindaraju, N., Gray, J., Kumar, R., and Manocha, D. 2006. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 325--336. Google ScholarDigital Library
- Govindaraju, N., Lloyd, B., Wang, W., Lin, M., and Manocha, D. 2004. Fast computation of database operations using graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 215--226. Google ScholarDigital Library
- Govindaraju, N. K., Raghuvanshi, N., and Manocha, D. 2005. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, NY, 611--622. Google ScholarDigital Library
- Graefe, G. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2, 73--169. Google ScholarDigital Library
- Harizopoulos, S., Abadi, D. J., Madden, S., and Stonebraker, M. 2008. OLTP through the looking glass, and what we found there. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 981--992. Google ScholarDigital Library
- Harris, M., Owens, J., Sengupta, S., Zhang, Y., and Davidson, A. 2007. CUDPP: CUDA data parallel primitives library.Google Scholar
- He, B., Govindaraju, N. K., Luo, Q., and Smith, B. 2007. Efficient gather and scatter operations on graphics processors. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 1--12. Google ScholarDigital Library
- He, B. and Luo, Q. 2008. Cache-oblivious databases: Limitations and opportunities. ACM Trans. Datab. Syst. 33, 2, 1--42. Google ScholarDigital Library
- He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., and Sander, P. 2008. Relational joins on graphics processors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, New York, NY, 511--524. Google ScholarDigital Library
- Héman, S., Nes, N., Zukowski, M., and Boncz, P. 2007. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN). ACM, New York, NY, 1--6. Google ScholarDigital Library
- Hong, W. and Stonebraker, M. 1991. Optimization of parallel query execution plans in xprs. In Proceedings of the 1st International Conference on Parallel and Distributed Information Systems (PDIS). IEEE Computer Society Press, Los Alamitos, CA, 218--225. Google ScholarDigital Library
- Johnson, R., Harizopoulos, S., Hardavellas, N., Sabirli, K., Pandis, I., Ailamaki, A., Mancheril, N. G., and Falsafi, B. 2007. To share or not to share? In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 351--362. Google ScholarDigital Library
- LaMarca, A. and Ladner, R. E. 1997. The influence of caches on the performance of sorting. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, PA, 370--379. Google ScholarDigital Library
- Lieberman, M. D., Sankaranarayanan, J., and Samet, H. 2008. A fast similarity join algorithm using graphics processing units. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Liu, B. and Rundensteiner, E. A. 2005. Revisiting pipelined parallelism in multi-join query processing. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 829--840. Google ScholarDigital Library
- Lu, H., Tan, K.-L., and Sahn, M.-C. 1990. Hash-based join algorithms for multiprocessor computers with shared memory. In Proceedings of the 16th International Conference on Very Large Databases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 198--209. Google ScholarDigital Library
- Manegold, S., Boncz, P., and Kersten, M. 2002. Generic database cost models for hierarchical memory systems. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). Google ScholarDigital Library
- Nguyen, H. 2008. GPU gems 3. Addison-Wesley. Google ScholarDigital Library
- Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26.Google Scholar
- Rao, J. and Ross, K. A. 1999. Cache conscious indexing for decision-support in main memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 78--89. Google ScholarDigital Library
- Schneider, D. A. and DeWitt, D. J. 1989. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. SIGMOD Rec. 18, 2, 110--121. Google ScholarDigital Library
- Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM Press, New York, NY, 23--34. Google ScholarDigital Library
- Sengupta, S., Harris, M., Zhang, Y., and Owens, J. D. 2007. Scan primitives for gpu computing. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. Google ScholarDigital Library
- Shao, M., Ailamaki, A., and Falsafi, B. 2005. DBmbench: Fast and accurate database workload representation on modern microarchitecture. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). IBM Press, 254--267. Google ScholarDigital Library
- Shatdal, A., Kant, C., and Naughton, J. F. 1994. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 510--521. Google ScholarDigital Library
- Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., and Zdonik, S. 2005. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 553--564. Google ScholarDigital Library
- Sun, C., Agrawal, D., and Abbadi, A. E. 2003. Hardware acceleration for spatial selections and joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 455--466. Google ScholarDigital Library
- Tarditi, D., Puri, S., and Oglesby, J. 2006. Accelerator: using data parallelism to program GPUS for general-purpose uses. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- Zhou, J., Cieslewicz, J., Ross, K. A., and Shah, M. 2005. Improving database performance on simultaneous multithreading processors. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 49--60. Google ScholarDigital Library
- Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). IEEE Computer Society, Washington, DC, 59. Google ScholarDigital Library
Index Terms
- Relational query coprocessing on graphics processors
Recommendations
Relational joins on graphics processors
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataWe present a novel design and implementation of relational join algorithms for new-generation graphics processing units (GPUs). The most recent GPU features include support for writing to random memory locations, efficient inter-processor communication, ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Efficient gather and scatter operations on graphics processors
SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingGather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs).
...
Comments