skip to main content
10.1145/3173162.3173211acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Published:19 March 2018Publication History

ABSTRACT

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

References

  1. "LTRF Register-Interval-Algorithm," https://github.com/Carnegie Mellon University-SAFARI/Register-Interval.Google ScholarGoogle Scholar
  2. M. Abdel-Majeed and M. Annavaram, "Warped register file: A power efficient register file for GPGPUs," in HPCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Abdel-Majeed, A. Shafaei, H. Jeon, M. Pedram, and M. Annavaram, "Pilot Register File: Energy efficient partitioned register file for GPUs," in HPCA, 2017.Google ScholarGoogle Scholar
  4. A. Annunziata, M. Gaidis, L. Thomas, C. Chien, C. Hung, P. Chevalier, E. O'Sullivan, J. Hummel, E. Joseph, Y. Zhu et al., "Racetrack memory cell array with integrated magnetic tunnel junction readout," in IEDM, 2011.Google ScholarGoogle Scholar
  5. C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, V. K. De, and K. Roy, "Numerical analysis of domain wall propagation for dense memory arrays," in IEDM, 2011.Google ScholarGoogle Scholar
  6. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting inter-warp heterogeneity to improve gpgpu performance," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bakhoda, J. Kim, and T. M. Aamodt, "On-chip network design considerations for compute accelerators," in PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-effective on-chip networks for manycore accelerators," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Bakhoda, J. Kim, and T. M. Aamodt, "Designing on-chip networks for throughput accelerators," in ACM TACO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.Google ScholarGoogle Scholar
  11. R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors," in MICRO, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. K. Bhuwalka, S. Sedlmaier, A. K. Ludsteck, C. Tolksdorf, J. Schulze, and I. Eisele, "Vertical tunnel field-effect transistor," in IEEE TED, 2004.Google ScholarGoogle Scholar
  13. E. Borch, E. Tune, S. Manne, and J. Emer, "Loose loops sink chips," in HPCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. D. Cooper and T. J. Harvey, "Compiler-controlled memory," in ASPLOS, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, "Multiple-banked register file architectures," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," in IEEE TCAD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori et al., "Low-current perpendicular domain wall motion cell for scalable high-speed mram," in VLSIT, 2009.Google ScholarGoogle Scholar
  19. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Gebhart, S. W. Keckler, and W. J. Dally, "A compile-time managed multi-level register file hierarchy," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. S. Hecht, Flow analysis of computer programs. hskip 1em plus 0.5em minus 0.4emrelax Elsevier Science Inc., 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C.-C. Hsiao, S.-L. Chu, and C.-C. Hsieh, "An adaptive thread scheduling mechanism with low-power register file for mobile GPUs," in IEEE TMM, 2014.Google ScholarGoogle Scholar
  24. H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, "Bandwidth-efficient on-chip interconnect designs for GPGPUs," in DAC, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU register file virtualization," in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, "Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs," in IEEE TC, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Jing, H. Liu, Y. Lu, and X. Liang, "Compiler assisted dynamic register file in GPGPU," in ISLPED, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal Corretger, and X. Liang, "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. M. Jones, M. F. P. O'Boyle, J. Abella, A. González, and O. Ergin, "Energy-efficient register caching with compiler assistance," in ACM TACO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, "The imagine stream processor," in ICCD, 2002.Google ScholarGoogle Scholar
  33. O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: optimizing thread-level parallelism for GPGPUs," in PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Kim, J. Balfour, and W. Dally, "Flattened butterfly topology for on-chip networks," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Kloosterman, J. Beaumont, D. A. Jamshidi, J. Bailey, T. Mudge, and S. Mahlke, "Regless: Just-in-time operand staging for GPUs," in MICRO, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in CGO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, "Warped-Compression: Enabling power efficient GPUs through register compression," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. E. Lewis, D. Petit, L. O'Brien, A. Fernandez-Pacheco, J. Sampaio, A. Jausovec, H. Zeng, D. Read, and R. Cowburn, "Fast domain wall motion in magnetic comb structures," in Nature Materials, 2010.Google ScholarGoogle Scholar
  42. C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Z. Li, J. Tan, and X. Fu, "Hybrid CMOS-TFET based register files for energy-efficient GPGPUs," in ISQED, 2013.Google ScholarGoogle Scholar
  44. J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls, "Simulating multiported memories using lower port count memories," 2008, US Patent 7,339,592.Google ScholarGoogle Scholar
  45. X. Liu, Y. Li, Y. Zhang, A. K. Jones, and Y. Chen, "STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures," in ASP-DAC, 2014.Google ScholarGoogle Scholar
  46. X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, "An efficient STT-RAM-based register file in GPU architectures," in ASP-DAC, 2015.Google ScholarGoogle Scholar
  47. A. Magni, C. Dubach, and M. F. P. O'Boyle, "A large-scale cross-architecture evaluation of thread-coarsening," in SC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li, "Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory," in DAC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems," in NOCS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Mookerjea and S. Datta, "Comparative study of si, ge and inas based steep subthreshold slope tunnel transistors for 0.25 v supply voltage logic applications," in Device Research Conference, 2008.Google ScholarGoogle Scholar
  51. N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep., 2009.Google ScholarGoogle Scholar
  52. G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, "Optimal loop unrolling for GPGPU programs," in IPDPS, 2010.Google ScholarGoogle Scholar
  53. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. P. R. Nuth and W. J. Dally, "The named-state register file: Implementation and performance," in HPCA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nvidia, "C programming guide V6. 5. 2014," San Jose California: Nvidia.Google ScholarGoogle Scholar
  56. Nvidia, "White paper: NVIDIA GeForce GTX 980," Nvidia, Tech. Rep.Google ScholarGoogle Scholar
  57. Nvidia, "White paper: NVIDIA Tesla P100," Nvidia, Tech. Rep.Google ScholarGoogle Scholar
  58. D. W. Oehmke, N. L. Binkert, T. Mudge, and S. K. Reinhardt, "How to fake 1000 registers," in MICRO, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. S. S. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," in Science, 2008.Google ScholarGoogle Scholar
  60. W. M. Reddick and G. A. Amaratunga, "Silicon surface tunnel transistor," Applied Physics Letters, 1995.Google ScholarGoogle Scholar
  61. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory access scheduling," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. R. M. Russell, "The CRAY-1 computer system," Commun. ACM, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad, "Effective cache bank placement for GPUs," in DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. M. H. Samavatian, H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad, "An efficient STT-RAM last level cache architecture for GPUs," in DAC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. M. H. Samavatian, M. Arjomand, R. Bashizade, and H. Sarbazi-Azad, "Architecting the last-level cache for GPUs using STT-RAM technology," in ACM TODAES, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, "APOGEE: Adaptive prefetching on GPUs for energy efficiency," in PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. A. Sethia and S. Mahlke, "Equalizer: Dynamic tuning of gpu resources for efficient execution," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches," in ISLPED, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. R. Shioya, K. Horio, M. Goshima, and S. Sakai, "Register cache system not for latency reduction purpose," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykrishnan, and D. Pradhan, "A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications," in ASP-DAC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, UIUC, Tech. Rep., 2012.Google ScholarGoogle Scholar
  73. J. A. Swensen and Y. N. Patt, "Hierarchical registers for scientific computers," in ICS, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. L. Thomas, R. Moriya, C. Rettner, and S. S. Parkin, "Dynamics of magnetic domain walls under their own inertia," in Science, 2010.Google ScholarGoogle Scholar
  75. Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez, "Adaptive GPU Cache Bypassing," in GPGPU, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "Stag: Spintronic-tape architecture for GPGPU cache hierarchies," in ISCA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, "Dwm-tapestri-an energy efficient all-spin cache using domain wall shift based writes," in DATE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A holistic approach to resource virtualization in GPUs," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. J. Wang and Y. Xie, "A write-aware STTRAM-based register file architecture for GPGPU," in ACM JETC, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. P.-F. Wang, "Complementary tunneling-FETs (CTFET) in CMOS technology," Ph.D. dissertation, Technische Universit"at München, Universit"atsbibliothek, 2003.Google ScholarGoogle Scholar
  82. X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan, "Enabling coordinated register allocation and thread-level parallelism optimization for GPUs," in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in ICCAD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou, "A unified optimizing compiler framework for different GPGPU architectures," in ACM TACO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. R. Yung and N. C. Wilhelm, "Caching processor general registers," in ICCD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. H. Zeng and K. Ghose, "Register file caching for energy efficiency," in ISLPED, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.Google ScholarGoogle Scholar

Index Terms

  1. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2018
        827 pages
        ISBN:9781450349116
        DOI:10.1145/3173162
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 53, Issue 2
          ASPLOS '18
          February 2018
          809 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3296957
          Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 March 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ASPLOS '18 Paper Acceptance Rate56of319submissions,18%Overall Acceptance Rate535of2,713submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader