skip to main content
research-article
Free Access

When Prefetching Works, When It Doesn’t, and Why

Published:01 March 2012Publication History
Skip Abstract Section

Abstract

In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important available techniques for doing so; yet, we claim that prefetching is perhaps also the least well-understood.

Thus, the goal of this study is to develop a novel, foundational understanding of both the benefits and limitations of hardware and software prefetching. Our study includes: source code-level analysis, to help in understanding the practical strengths and weaknesses of compiler- and software-based prefetching; a study of the synergistic and antagonistic effects between software and hardware prefetching; and an evaluation of hardware prefetching training policies in the presence of software prefetching requests. We use both simulation and measurement on real systems. We find, for instance, that although there are many opportunities for compilers to prefetch much more aggressively than they currently do, there is also a tangible risk of interference with training existing hardware prefetching mechanisms. Taken together, our observations suggest new research directions for cooperative hardware/software prefetching.

References

  1. Al-Sukhni, H., Bratt, I., and Connors, D. A. 2003. Compiler-directed content-aware prefetching for dynamic data structures. In Proceedings of the 12th International Conference on Parallel Architecture and Compilation Technology. IEEE, Los Alamitos, CA, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. AMD Phenom II Processors. http://www.amd.com/us/products/desktop/processors/phenom-ii/Pages/phenom-ii.aspx.Google ScholarGoogle Scholar
  3. Badawy, A.-H. A., Aggarwal, A., Yeung, D., and Tseng, C.-W. 2004. The efficacy of software prefetching and locality optimizations on future memory systems. J. Instruct.-Level Parallelism 6.Google ScholarGoogle Scholar
  4. Baer, J. and Chen, T. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 176--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Callahan, D., Kennedy, K., and Porterfield, A. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 40--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C. 2007. Improving hash join performance through prefetching. ACM Trans. Datab. Syst. 32, 3, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen, T.-F. and Baer, J.-L. 1994. A performance study of software and hardware data prefetching schemes. In Proceedings of the 16th International Symposium on Computer Architecture. 223--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 306--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Collins, J. D., Sair, S., Calder, B., and Tullsen, D. M. 2002. Pointer cache assisted prefetching. In Proceedings of the 35th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 62--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Prog. Languages and Operating Systems. ACM, New York, NY, 279--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ebrahimi, E., Mutlu, O., and Patt, Y. N. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 7--17.Google ScholarGoogle Scholar
  12. Emma, P. G., Hartstein, A., Puzak, T. R., and Srinivasan, V. 2005. Exploring the limits of prefetching. IBM J. Resear. Devel. 49, 127--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. GCC-4.0. GNU compiler collection. http://gcc.gnu.org/.Google ScholarGoogle Scholar
  14. Hur, I. and Lin, C. 2006. Memory prefetching using adaptive stream detection. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 397--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hur, I. and Lin, C. 2009. Feedback mechanisms for improving probabilistic memory prefetching. In Proceedings of the 15th International Symposium on High Perf Compo Architecture. IEEE Computer Society, Los Alamitos, CA, 443--454.Google ScholarGoogle Scholar
  16. ICC. Intel C++ compiler. http://www.intel.comlcd/software/products/asmo-na/eng/compilers/clin/277618.htm.Google ScholarGoogle Scholar
  17. Intel. 2004. Intel Pentium M Processor. http://www.intel.com/design/intarch/pentiumm/pentiumm.htm.Google ScholarGoogle Scholar
  18. Intel. 2007. Intel core microarchitecture. http://www.intel.com/technology/45nm/index.htm?iid=tech_micro+45nm.Google ScholarGoogle Scholar
  19. Intel. 2008. Intel AVX. http://software.intel.com/en-us/avx.Google ScholarGoogle Scholar
  20. Intel. 2009. Intel Nehalem microarchitecture. http://www.intel.com/technology/architecture-silicon/next-gen/index.htm?iid=tech_micro+nehalem.Google ScholarGoogle Scholar
  21. Intel. 2011. Intel 64 and IA-32 Architectures Software Developer’s Manual. http://www3.intel.com/Assets/PDF/manual/253667.pdf.Google ScholarGoogle Scholar
  22. Jerger, N., Hill, E., and Lipasti, M. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, Los Alamitos, CA, 177--188.Google ScholarGoogle Scholar
  23. Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the 19th International Symposium on Computer Architecture. ACM, New York, NY, 252--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 12th International Symposium on Computer Architecture. ACM, New York, NY, 388--397.Google ScholarGoogle Scholar
  25. Kroft, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 3rd International Symposium on Computer Architecture. IEEE Computer Society Press, Los Alamitos, CA, 81--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 144--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lipasti, M. H., Schmidt, W. J., Kunkel, S. R., and Roediger, R. R. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 232--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 40--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Luk, C.-K. and Mowry, T. C. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 222--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Luk, C.-K. and Mowry, T. C. 1998. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings of the 31st International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 182--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 62--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nesbit, K. J., Dhodapkar, A. S., and Smith, J. E. 2004. AC/DC: An adaptive data cache prefetcher. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 135--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nesbit, K. J. and Smith, J. E. 2004. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 96--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pai, V. S. and Adve, S. V. 2001. Comparing and combining read miss clustering and software prefetching. In Proceedings of the 10th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 292--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and Karunanidhi, A. 2004. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Perez, D. G., Mouchard, G., and Temam, O. 2004. Microlib: A case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 43--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pin. A binary instrumentation tool. http://www.pintool.org.Google ScholarGoogle Scholar
  38. Roth, A. and Sohi, G. S. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 21st International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 111--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Saavedra, R. H. and Park, D. 1996. Improving the effectiveness of software prefetching with adaptive execution. In Proceedings of the 5th International Conference on Parallel Architecture and Compilation Technology. IEEE Computer Society, Los Alamitos, CA, 68--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Son, S. W., Kandemir, M., Karakoy, M., and Chakrabarti, D. 2009. A compiler-directed data prefetching scheme for chip multiprocessors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 209--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 13th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tendler, J., Dodson, S., Fields, S., Le, H., and Sinharoy, B. 2002. POWER4 system microarchitecture. IBM J. Resear. Devel. 46, 1, 5--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Vanderwiel, S. P. and Lilja, D. J. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2, 174--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wang, Z., Burger, D., McKinley, K. S., Reinhardt, S. K., and Weems, C. C. 2003. Guided region pre fetching: A cooperative hardware/software approach. In Proceedings of the 25th International Symposium on Computer Architecture. ACM, New York, NY, 388--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wu, Y. 2002. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 210--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yang, C.-L., Lebeck, A. R., Tseng, H.-W., and Lee, C.-H. 2004. Tolerating memory latency through push prefetching for pointer-intensive applications. ACM Trans. Architect. Code Optim. 1, 4, 445--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhang, W., Calder, B., and Tullsen, D. M. 2006. A self-repairing prefetcher in an event-driven dynamic optimization framework. In Proceedings of the 4th International Symposium on Code Generation and Optimization. IEEE Computer Society, Los Alamitos, CA, 50--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 23rd International Symposium on Computer Architecture. ACM, New York, NY, 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. When Prefetching Works, When It Doesn’t, and Why

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 1
        March 2012
        176 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2133382
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 March 2012
        • Accepted: 1 August 2011
        • Revised: 1 July 2011
        • Received: 1 November 2010
        Published in taco Volume 9, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Author Tags

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader