Abstract
In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important available techniques for doing so; yet, we claim that prefetching is perhaps also the least well-understood.
Thus, the goal of this study is to develop a novel, foundational understanding of both the benefits and limitations of hardware and software prefetching. Our study includes: source code-level analysis, to help in understanding the practical strengths and weaknesses of compiler- and software-based prefetching; a study of the synergistic and antagonistic effects between software and hardware prefetching; and an evaluation of hardware prefetching training policies in the presence of software prefetching requests. We use both simulation and measurement on real systems. We find, for instance, that although there are many opportunities for compilers to prefetch much more aggressively than they currently do, there is also a tangible risk of interference with training existing hardware prefetching mechanisms. Taken together, our observations suggest new research directions for cooperative hardware/software prefetching.
- Al-Sukhni, H., Bratt, I., and Connors, D. A. 2003. Compiler-directed content-aware prefetching for dynamic data structures. In Proceedings of the 12th International Conference on Parallel Architecture and Compilation Technology. IEEE, Los Alamitos, CA, 91--100. Google ScholarDigital Library
- AMD. AMD Phenom II Processors. http://www.amd.com/us/products/desktop/processors/phenom-ii/Pages/phenom-ii.aspx.Google Scholar
- Badawy, A.-H. A., Aggarwal, A., Yeung, D., and Tseng, C.-W. 2004. The efficacy of software prefetching and locality optimizations on future memory systems. J. Instruct.-Level Parallelism 6.Google Scholar
- Baer, J. and Chen, T. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 176--186. Google ScholarDigital Library
- Callahan, D., Kennedy, K., and Porterfield, A. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 40--52. Google ScholarDigital Library
- Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C. 2007. Improving hash join performance through prefetching. ACM Trans. Datab. Syst. 32, 3, 17. Google ScholarDigital Library
- Chen, T.-F. and Baer, J.-L. 1994. A performance study of software and hardware data prefetching schemes. In Proceedings of the 16th International Symposium on Computer Architecture. 223--232. Google ScholarDigital Library
- Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 306--317. Google ScholarDigital Library
- Collins, J. D., Sair, S., Calder, B., and Tullsen, D. M. 2002. Pointer cache assisted prefetching. In Proceedings of the 35th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 62--73. Google ScholarDigital Library
- Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Prog. Languages and Operating Systems. ACM, New York, NY, 279--290. Google ScholarDigital Library
- Ebrahimi, E., Mutlu, O., and Patt, Y. N. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 7--17.Google Scholar
- Emma, P. G., Hartstein, A., Puzak, T. R., and Srinivasan, V. 2005. Exploring the limits of prefetching. IBM J. Resear. Devel. 49, 127--144. Google ScholarDigital Library
- GCC-4.0. GNU compiler collection. http://gcc.gnu.org/.Google Scholar
- Hur, I. and Lin, C. 2006. Memory prefetching using adaptive stream detection. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 397--408. Google ScholarDigital Library
- Hur, I. and Lin, C. 2009. Feedback mechanisms for improving probabilistic memory prefetching. In Proceedings of the 15th International Symposium on High Perf Compo Architecture. IEEE Computer Society, Los Alamitos, CA, 443--454.Google Scholar
- ICC. Intel C++ compiler. http://www.intel.comlcd/software/products/asmo-na/eng/compilers/clin/277618.htm.Google Scholar
- Intel. 2004. Intel Pentium M Processor. http://www.intel.com/design/intarch/pentiumm/pentiumm.htm.Google Scholar
- Intel. 2007. Intel core microarchitecture. http://www.intel.com/technology/45nm/index.htm?iid=tech_micro+45nm.Google Scholar
- Intel. 2008. Intel AVX. http://software.intel.com/en-us/avx.Google Scholar
- Intel. 2009. Intel Nehalem microarchitecture. http://www.intel.com/technology/architecture-silicon/next-gen/index.htm?iid=tech_micro+nehalem.Google Scholar
- Intel. 2011. Intel 64 and IA-32 Architectures Software Developer’s Manual. http://www3.intel.com/Assets/PDF/manual/253667.pdf.Google Scholar
- Jerger, N., Hill, E., and Lipasti, M. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, Los Alamitos, CA, 177--188.Google Scholar
- Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the 19th International Symposium on Computer Architecture. ACM, New York, NY, 252--263. Google ScholarDigital Library
- Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 12th International Symposium on Computer Architecture. ACM, New York, NY, 388--397.Google Scholar
- Kroft, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 3rd International Symposium on Computer Architecture. IEEE Computer Society Press, Los Alamitos, CA, 81--87. Google ScholarDigital Library
- Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 144--154. Google ScholarDigital Library
- Lipasti, M. H., Schmidt, W. J., Kunkel, S. R., and Roediger, R. R. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 232--236. Google ScholarDigital Library
- Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 40--51. Google ScholarDigital Library
- Luk, C.-K. and Mowry, T. C. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 222--233. Google ScholarDigital Library
- Luk, C.-K. and Mowry, T. C. 1998. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings of the 31st International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 182--194. Google ScholarDigital Library
- Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 62--73. Google ScholarDigital Library
- Nesbit, K. J., Dhodapkar, A. S., and Smith, J. E. 2004. AC/DC: An adaptive data cache prefetcher. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 135--145. Google ScholarDigital Library
- Nesbit, K. J. and Smith, J. E. 2004. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 96--105. Google ScholarDigital Library
- Pai, V. S. and Adve, S. V. 2001. Comparing and combining read miss clustering and software prefetching. In Proceedings of the 10th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 292--303. Google ScholarDigital Library
- Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and Karunanidhi, A. 2004. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 81--92. Google ScholarDigital Library
- Perez, D. G., Mouchard, G., and Temam, O. 2004. Microlib: A case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 43--54. Google ScholarDigital Library
- Pin. A binary instrumentation tool. http://www.pintool.org.Google Scholar
- Roth, A. and Sohi, G. S. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 21st International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 111--121. Google ScholarDigital Library
- Saavedra, R. H. and Park, D. 1996. Improving the effectiveness of software prefetching with adaptive execution. In Proceedings of the 5th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 68--78. Google ScholarDigital Library
- Son, S. W., Kandemir, M., Karakoy, M., and Chakrabarti, D. 2009. A compiler-directed data prefetching scheme for chip multiprocessors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 209--218. Google ScholarDigital Library
- Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 13th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 63--74. Google ScholarDigital Library
- Tendler, J., Dodson, S., Fields, S., Le, H., and Sinharoy, B. 2002. POWER4 system microarchitecture. IBM J. Resear. Devel. 46, 1, 5--25. Google ScholarDigital Library
- Vanderwiel, S. P. and Lilja, D. J. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2, 174--199. Google ScholarDigital Library
- Wang, Z., Burger, D., McKinley, K. S., Reinhardt, S. K., and Weems, C. C. 2003. Guided region pre fetching: A cooperative hardware/software approach. In Proceedings of the 25th International Symposium on Computer Architecture. ACM, New York, NY, 388--398. Google ScholarDigital Library
- Wu, Y. 2002. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 210--221. Google ScholarDigital Library
- Yang, C.-L., Lebeck, A. R., Tseng, H.-W., and Lee, C.-H. 2004. Tolerating memory latency through push prefetching for pointer-intensive applications. ACM Trans. Architect. Code Optim. 1, 4, 445--475. Google ScholarDigital Library
- Zhang, W., Calder, B., and Tullsen, D. M. 2006. A self-repairing prefetcher in an event-driven dynamic optimization framework. In Proceedings of the 4th International Symposium on Code Generation and Optimization. IEEE Computer Society, Los Alamitos, CA, 50--64. Google ScholarDigital Library
- Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 2--13. Google ScholarDigital Library
Index Terms
- When Prefetching Works, When It Doesn’t, and Why
Recommendations
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Adaptive prefetching using global history buffer in multicore processors
Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Comments