skip to main content
research-article
Free Access

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

Published:01 December 2013Publication History
Skip Abstract Section

Abstract

The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU systems is dramatically increased. Up to 30% of the total power of a GPU system is consumed by the graphic memory itself. Therefore, reducing graphics memory power consumption is critical to mitigate the power challenge.

In this article, we propose an energy-efficient reconfigurable 3D die-stacking graphics memory design that integrates wide-interface graphics DRAMs side-by-side with a GPU processor on a silicon interposer. The proposed architecture is a “3D+2.5D” system, where the DRAM memory itself is 3D stacked memory with through-silicon via (TSV), whereas the integration of DRAM and the GPU processor is through the interposer solution (2.5D). Since GPU computing units, memory controllers, and memory are all integrated in the same package, the number of memory I/Os is no longer constrained by the package’s pin count. We can reduce the memory power consumption by scaling down the supply voltage and frequency of memory interface while maintaining the same or even higher peak memory bandwidth. In addition, we design a reconfigurable memory interface that can dynamically adapt to the requirements of various applications. We propose two reconfiguration mechanisms to optimize the GPU system energy efficiency and throughput, respectively, and thus benefit both memory-intensive and compute-intensive applications. The experimental results show that the proposed GPU memory architecture can effectively improve GPU system energy efficiency by 21%, without reconfiguration. The reconfigurable memory interface can further improve the system energy efficiency by 26%, and system throughput by 31% under a capped system power budget of 240W.

References

  1. Akazawa, M., Kuramochi, S., Maruyama, T., Nakayama, K., Takano, A., Yamaguchi, M., and Fukuoka, Y. 2003. High-density packaging technologies on silicon substrates. In Proceedings of the Electronic Components and Technology Conference. 647--651.Google ScholarGoogle Scholar
  2. Al Maashri, A., Sun, G., Dong, X., Narayanan, V., and Xie, Y. 2009. 3D GPU Architecture Using Cache Stacking: Performance, Cost, Power and Thermal Analysis. In Proceedings of the International Conferenece on Computer Design. 254--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD. 2012. AMD Radeon#8482; HD 7970 Graphics. http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx.Google ScholarGoogle Scholar
  4. Andry, P. S., Tsang, C., Sprogis, E., Patel, C., Wright, S. L., Webb, B. C., Buchwalter, L. P., Manzer, D., Horton, R., Polastre, R., and Knickerbocker, J. 2006. A CMOS-compatible process for fabricating electrical through-vias in silicon. In Proceedings of the Electronic Components and Technology Conference. 1--7.Google ScholarGoogle Scholar
  5. Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google ScholarGoogle Scholar
  6. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. CUDASDK. 2010. GPU Computing SDK. (2010). https://developer.nvidia.com/gpu-computing-sdk.Google ScholarGoogle Scholar
  8. David, H., Fallin, C., Gorbatov, E., Hanebutte, U. R., and Mutlu, O. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of the International Conference on Autonomic Computing. 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Deng, D., Meisner, D., Bhattacharjee, A., Wenisch, T. F., and Bianchini, R. 2012. CoScale: Coordinating CPU and memory system DVFS in server systems. In Proceedings of the International Symposium on Microarchitecture. 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deng, Q., Meisner, D., Ramos, L., Wenisch, T.  F., and Bianchini, R. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 225--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dong, X., Xie, Y., Muralimanohar, N., and Jouppi, N. P. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dorsey, P. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs, WP 380, October 27, 2010, 1--10.Google ScholarGoogle Scholar
  13. Elpida. 2010. Introduction to GDDR5 SGRAM. http://www.elpida.com/pdfs/E1600E10.pdf.Google ScholarGoogle Scholar
  14. Galal, S. and Horowitz, M. 2011. Energy-efficient floating-point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gebhart, M., Johnson, D. R., Tarjan, D., Keckler, S. W., Dally, W. J., Lindholm, E., and Skadron, K. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gu, S. Q., Marchal, P., Facchini, M., Wang, F., Suh, M., Lisk, D., and Nowak, M. 2008. Stackable memory of 3D chip integration for mobile applications. In Proceedings of the International Electron Devices Meeting. 1--4.Google ScholarGoogle Scholar
  17. Herbert, S. and Marculescu, D. 2007. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. 38--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hynix. 2009. Hynix GDDR5 SGRAM datasheet, http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24 AFR(Rev1.0).pdf.Google ScholarGoogle Scholar
  19. Intel. 2005. Thermal protection and monitoring features: A software perspective. In Intel Software Network. 1--6.Google ScholarGoogle Scholar
  20. Intel. 2012. Turbo boost technology 2.0. http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost.Google ScholarGoogle Scholar
  21. Isci, C., Contreras, G., and Martonosi, M. 2006. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In Proceedings of the International Symposium on Microarchitecture. 359--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Janzen, J. 2010. The Micron power calculator. http://www.micron.com/products/dram/syscalc.html.Google ScholarGoogle Scholar
  23. Jiao, Y., Lin, H., Balaji, P., and Feng, W. 2010. Power and performance characterization of computational kernels on the GPU. In Proceedings of the International Conference on Green Computing and Communications. 221--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kaxiras, S. and Martonosi, M. 2009. Computer Architecture Techniques for Power-Efficiency, Morgan and Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kgil, T., D’Souza, S., Saidi, A., Binkert, N., Dreslinski, R., Mudge, T., Reinhardt, S., and Flautner, K. 2006. PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Khan, N., Yoon, S. W., Viswanath, A. G. K., Ganesh, V. P., Ranganathan, D. W., Lim, S., and Vaidyanathan, K. 2006. Development of 3D stack package using silicon interposer for high power application. In Proceedings of the Electronic Components and Technology Conference. 1--5.Google ScholarGoogle Scholar
  27. Kim, J., Oh, C. S., Lee, H., Lee, D., Hwang, H., Hwang, S., Na, B., Moon, J., Kim, J., Park, H., Ryu, J., Park, K., Kang, S., Kim, S., Kim, H., Bang, J., Cho, H., Jang, M., Han, C., Lee, J., Kyung, K., Choi, J., and Jun, Y. 2011a. A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 496--498.Google ScholarGoogle Scholar
  28. Kim, N., Wu, D., Kim, D., Rahman, A., and Wu, P. 2011b. Interposer design optimization for high frequency signal transmission in passive and active interposer using through silicon via (TSV). In Proceedings of the Electronic Components and Technology Conference. 1160--1167.Google ScholarGoogle Scholar
  29. Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Liu, C. C., Burtscher, M., and Tiwari, S. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Des. Test 22, 6, 556--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Loh, G. H. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the International Symposium on Computer Architecture. 453--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Loi, G. L., Agrawal, B., Srivastava, N., Lin, S., Sherwood, T., and Banerjee, K. 2006. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the Design Automation Conference. 991--996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Micron. 2013. Hybrid Memory Cube specification 1.0. http://hybridmemorycube.org/files/SiteDownloads/HMC_Specification%201_0. pdf.Google ScholarGoogle Scholar
  34. NVIDIA. 2008. PowerMizer 8.0 Intelligent Power Management Technology. Tech. brief. http://www.nvidia.com/object/feature_powermizer.html.Google ScholarGoogle Scholar
  35. NVIDIA. 2010. Quadro 6000—Workstation graphics card for 3D design, styling, visualization, CAD, and more. http://www.nvidia.com/object/product-quadro-6000-us.html.Google ScholarGoogle Scholar
  36. Ren, D. Q. and Suda, R. 2010. Modeling and optimizing the power performance of large matrices multiplication on multi-core and GPU platform with CUDA. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. 421--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Samsung. 2010. DDR3 and GDDR5. http://www.samsung.com/global/business/semiconductor/products/Products.html.Google ScholarGoogle Scholar
  38. Skadron, K., Stan, M. R., Sankaranarayanan, K., Huang, W., Velusamy, S., and Tarjan, D. 2004. Temperature-aware microarchitecture: Modeling and implementation. ACM T. Archit. Code Op. 1, 1, 94--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sunohara, M., Tokunaga, T., Kurihara, T., and Higashi, M. 2008. Silicon interposer with TSVs (through silicon vias) and fine multilayer wiring. In Proceedings of the Electronic Components and Technology Conference. 847--852.Google ScholarGoogle Scholar
  40. Tezzaron. 2010. FaStack 3D stackable DRAM. http://www.tezzaron.com/memory/FaStack_memory.html.Google ScholarGoogle Scholar
  41. Vick, E., Goodwin, S., Cunnigham, G., and Temple, D. S. 2012. Vias-last process technology for thick 2.5D Si interposers. In Proceedings of the 3D Systems Integration Conference. 1--4.Google ScholarGoogle Scholar
  42. Wang, P., Cheng, Y., Yang, C., and Cheng, Y. 2009. A predictive shutdown technique for GPU Shader processors. IEEE Comput. Archit. Lett. 8, 1, 9--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Conference for High Performance Computing. 1--12.Google ScholarGoogle Scholar
  44. Wu, Q., Martonosi, M., Clark, D. W., Reddi, V. J., Connors, D., Wu, Y., Lee, J., and Brooks, D. 2005. A dynamic compilation framework for controlling microprocessor energy and performance. In Proceedings of the International Symposium on Microarchitecture. 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yu, W. S., Huang, S. Q., Wang, S., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. 247--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zhao, J., Sun, G., Loh, G. H., and Xie, Y. 2012. Energy-efficient GPU design with reconfigurable in-package graphics memory. In Proceedings of the International Symposium on Low Power Electronics and Design. 403--408. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
      December 2013
      1046 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2541228
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2013
      • Accepted: 1 August 2013
      • Revised: 1 July 2013
      • Received: 1 April 2013
      Published in taco Volume 10, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader