Abstract
Hash join algorithms suffer from extensive CPU cache stalls. This article shows that the standard hash join algorithm for disk-oriented databases (i.e. GRACE) spends over 80% of its user time stalled on CPU cache misses, and explores the use of CPU cache prefetching to improve its cache performance. Applying prefetching to hash joins is complicated by the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, that overcome these complications. These schemes achieve 1.29--4.04X speedups for the join phase and 1.37--3.49X speedups for the partition phase over GRACE and simple prefetching approaches. Moreover, compared with previous cache-aware approaches (i.e. cache partitioning), the schemes are at least 36% faster on large relations and do not require exclusive use of the CPU cache to be effective. Finally, comparing the elapsed real times when disk I/Os are in the picture, our cache prefetching schemes achieve 1.12--1.84X speedups for the join phase and 1.06--1.60X speedups for the partition phase over the GRACE hash join algorithm.
- Baer, J.-L. and Chen, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing'91 (Albuquerque, NM). 176--186. Google ScholarDigital Library
- Boncz, P. A., Manegold, S., and Kersten, M. L. 1999. Database architecture optimized for the new bottleneck: memory access. In Proceedings of the 25th International Conference on Very Large Data Bases. Edinburgh, Scotland, UK, 54--65. Google ScholarDigital Library
- Chen, S. 2005. Redesigning database systems in light of CPU cache prefetching. Ph.D. thesis, Carnegie Mellon University. Google ScholarDigital Library
- Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C. 2004. Improving hash join performance through prefetching. In Proceedings of the 20th International Conference on Data Engineering (Boston, MA). IEEE Computer Society Press, Los Alamitos, CA, 116--127. Google ScholarDigital Library
- Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C. 2005. Inspector Joins. In Proceedings of the 31st International Conference on Very Large Data Bases (Trondheim, Norway). 817--828. Google ScholarDigital Library
- Chen, S., Gibbons, P. B., and Mowry, T. C. 2001. Improving Index Performance through Prefetching. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA). ACM, New York, 235--246. Google ScholarDigital Library
- Chen, S., Gibbons, P. B., Mowry, T. C., and Valentin, G. 2002. Fractal prefetching B+-trees: optimizing both cache and disk performance. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, (Madison, WI), ACM, New York, 157--168. Google ScholarDigital Library
- Gold, B. T., Ailamaki, A., Huston, L., and Falsafi, B. 2005. Accelerating database operations using a network processor. In Proceedings of the First International Workshop on Data Management on New Hardware (DaMoN 2005) (Baltimore, MD). Google ScholarDigital Library
- Graefe, G. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2, 73--170. Google ScholarDigital Library
- Hepkin, D. 2006. Guide to multiple page size support on AIX 5L version 5.3. http://www-03.ibm.com/servers/aix/whitepapers/multiple_page.pdf.Google Scholar
- IBM Corporation. 2004. IBM DB2 Universal Database Administration Guide Version 8.2.Google Scholar
- Intel Corporation. 2004. Intel Itanium 2 Processor Reference Manual For Software Development and Optimization. Intel Corporation. Order Number: 251110-003.Google Scholar
- Intel Corporation. 2006. IA-32 Intel Architecture Software Developer's Manual (Volumn 3A and 3B): System Programming Guide. Intel Corporation.Google Scholar
- Kahle, J. A., Day, M. N., Hofstee, H. P., Jonhns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the Cell Multiprocessor. IBM J. Res. Devel. 49, 4/5 (July/Sept.), 589--604. Google ScholarDigital Library
- Kalla, R. N., Sinharoy, B., and Tendler, J. M. 2004. IBM Power5 chip: A dual-core multithreaded processor. IEEE Micro 24, 2 (Mar./Apr.), 40--47. Google ScholarDigital Library
- Kennedy, K. and McKinley, K. S. 1990. Loop distribution with arbitrary control flow. In Proceedings of Supercomputing'90 (New York, NY). 407--416. Google ScholarDigital Library
- Kitsuregawa, M., Tanaka, H., and Moto-Oka, T. 1983. Application of hash to data base machine and its architecture. New Generation Comput. 1, 1, 63--74.Google ScholarCross Ref
- Lam, M. S. 1987. A systolic array optimizing compiler. Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA. Google ScholarDigital Library
- Lindsay, B. 2002. Hash joins in DB2 UDB: The inside story. Carnegie Mellon DB Seminar.Google Scholar
- Luk, C.-K. and Mowry, T. C. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, MA). 222--233. Google ScholarDigital Library
- Luk, C.-K. and Mowry, T. C. 1999. Automatic compiler-inserted prefetching for pointer-based applications. IEEE Trans. Comput. 48, 2 (Feb.), 134--141. Google ScholarDigital Library
- Manegold, S., Boncz, P. A., and Kersten, M. L. 2000. What happens during a join? Dissecting CPU and memory optimization effects. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 339--350. Google ScholarDigital Library
- McDougall, R. 2004. Supporting multiple page sizes in the solaris operating system. http://www.sun.com/blueprints/0304/817-5917.pdf.Google Scholar
- McFarling, S. 1993. Combining branch predictors. Tech. Rep. WRL Technical Note TN-36, Digital Equipment Corporation (June).Google Scholar
- Mowry, T. C. 1994. Tolerating latency through software-controlled data prefetching. Ph.D. dissertation. Stanford University. Google ScholarDigital Library
- Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA). 62--73. Google ScholarDigital Library
- Nakayama, M., Kitsuregawa, M., and Takagi, M. 1988. Hash-partitioned join method using dynamic destaging strategy. In Proceedings of the 14th International Conference on Very Large Data Bases (Los Angeles, CA). 468--478. Google ScholarDigital Library
- Perfmon Project. http://www.hpl.hp.com/research/linux/perfmon/index.php4.Google Scholar
- Saulsbury, A., Dahlgren, F., and Stenström, P. 2000. Recency-based TLB preloading. In Proceedings of the 27th International Symposium on Computer Architecture (Vancouver, BC, Canada). 117--127. Google ScholarDigital Library
- Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Datab. Syst. 11, 3, 239--264. Google ScholarDigital Library
- Shatdal, A., Kant, C., and Naughton, J. F. 1994. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases (Santiago de Chile). 510--521. Google ScholarDigital Library
- Sun Microsystems. UltraSPARC IV Processor Architecture Overview. Sun Microsystems. Technical Whitepaper, Version 1.0, Feb. 2004.Google Scholar
- TPC Benchmarks. http://www.tpc.org/.Google Scholar
- Zeller, H. and Gray, J. 1990. An adaptive hash join algorithm for multiuser environments. In Proceedings of the 16th International Conference on Very Large Data Bases. (Brisbane, Queensland, Australia). 186--197. Google ScholarDigital Library
- Zhou, J., Cieslewicz, J., Ross, K. A., and Shah, M. 2005. Improving database performance on simultaneous multithreading processors. In Proceedings of the 31st International Conference on Very Large Data Bases (Trondheim, Norway). 49--60. Google ScholarDigital Library
Index Terms
- Improving hash join performance through prefetching
Recommendations
Improving Hash Join Performance through Prefetching
ICDE '04: Proceedings of the 20th International Conference on Data EngineeringHash join algorithms suffer from extensive CPU cachestalls. This paper shows that the standard hash join algorithm for disk-oriented databases (i.e. GRACE) spends over73% of its user time stalled on CPU cache misses, and explores the use of prefetching ...
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Comments