research-article

Free Access

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

Authors:
Jishen Zhao

Pennsylvania State University CSE Department

Pennsylvania State University CSE Department
View Profile

,
Guangyu Sun

Peking University

Peking University
View Profile

,
Gabriel H. Loh

Advanced Micro Devices, Inc. AMD Research

Advanced Micro Devices, Inc. AMD Research
View Profile

,
Yuan Xie

Pennsylvania State University CSE Department

Pennsylvania State University CSE Department
View Profile

ACM Transactions on Architecture and Code Optimization Volume 10 Issue 4Article No.: 24pp 1–25https://doi.org/10.1145/2541228.2541231

Published:01 December 2013Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU systems is dramatically increased. Up to 30% of the total power of a GPU system is consumed by the graphic memory itself. Therefore, reducing graphics memory power consumption is critical to mitigate the power challenge.

In this article, we propose an energy-efficient reconfigurable 3D die-stacking graphics memory design that integrates wide-interface graphics DRAMs side-by-side with a GPU processor on a silicon interposer. The proposed architecture is a “3D+2.5D” system, where the DRAM memory itself is 3D stacked memory with through-silicon via (TSV), whereas the integration of DRAM and the GPU processor is through the interposer solution (2.5D). Since GPU computing units, memory controllers, and memory are all integrated in the same package, the number of memory I/Os is no longer constrained by the package’s pin count. We can reduce the memory power consumption by scaling down the supply voltage and frequency of memory interface while maintaining the same or even higher peak memory bandwidth. In addition, we design a reconfigurable memory interface that can dynamically adapt to the requirements of various applications. We propose two reconfiguration mechanisms to optimize the GPU system energy efficiency and throughput, respectively, and thus benefit both memory-intensive and compute-intensive applications. The experimental results show that the proposed GPU memory architecture can effectively improve GPU system energy efficiency by 21%, without reconfiguration. The reconfigurable memory interface can further improve the system energy efficiency by 26%, and system throughput by 31% under a capped system power budget of 240W.

References

Akazawa, M., Kuramochi, S., Maruyama, T., Nakayama, K., Takano, A., Yamaguchi, M., and Fukuoka, Y. 2003. High-density packaging technologies on silicon substrates. In Proceedings of the Electronic Components and Technology Conference. 647--651.Google Scholar
Al Maashri, A., Sun, G., Dong, X., Narayanan, V., and Xie, Y. 2009. 3D GPU Architecture Using Cache Stacking: Performance, Cost, Power and Thermal Analysis. In Proceedings of the International Conferenece on Computer Design. 254--259. Google ScholarDigital Library
AMD. 2012. AMD Radeon#8482; HD 7970 Graphics. http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx.Google Scholar
Andry, P. S., Tsang, C., Sprogis, E., Patel, C., Wright, S. L., Webb, B. C., Buchwalter, L. P., Manzer, D., Horton, R., Polastre, R., and Knickerbocker, J. 2006. A CMOS-compatible process for fabricating electrical through-vias in silicon. In Proceedings of the Electronic Components and Technology Conference. 1--7.Google Scholar
Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization. 44--54. Google ScholarDigital Library
CUDASDK. 2010. GPU Computing SDK. (2010). https://developer.nvidia.com/gpu-computing-sdk.Google Scholar
David, H., Fallin, C., Gorbatov, E., Hanebutte, U. R., and Mutlu, O. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of the International Conference on Autonomic Computing. 31--40. Google ScholarDigital Library
Deng, D., Meisner, D., Bhattacharjee, A., Wenisch, T. F., and Bianchini, R. 2012. CoScale: Coordinating CPU and memory system DVFS in server systems. In Proceedings of the International Symposium on Microarchitecture. 143--154. Google ScholarDigital Library
Deng, Q., Meisner, D., Ramos, L., Wenisch, T.  F., and Bianchini, R. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 225--238. Google ScholarDigital Library
Dong, X., Xie, Y., Muralimanohar, N., and Jouppi, N. P. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing. 1--11. Google ScholarDigital Library
Dorsey, P. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs, WP 380, October 27, 2010, 1--10.Google Scholar
Elpida. 2010. Introduction to GDDR5 SGRAM. http://www.elpida.com/pdfs/E1600E10.pdf.Google Scholar
Galal, S. and Horowitz, M. 2011. Energy-efficient floating-point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarDigital Library
Gebhart, M., Johnson, D. R., Tarjan, D., Keckler, S. W., Dally, W. J., Lindholm, E., and Skadron, K. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture. 235--246. Google ScholarDigital Library
Gu, S. Q., Marchal, P., Facchini, M., Wang, F., Suh, M., Lisk, D., and Nowak, M. 2008. Stackable memory of 3D chip integration for mobile applications. In Proceedings of the International Electron Devices Meeting. 1--4.Google Scholar
Herbert, S. and Marculescu, D. 2007. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. 38--43. Google ScholarDigital Library
Hynix. 2009. Hynix GDDR5 SGRAM datasheet, http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24 AFR(Rev1.0).pdf.Google Scholar
Intel. 2005. Thermal protection and monitoring features: A software perspective. In Intel Software Network. 1--6.Google Scholar
Intel. 2012. Turbo boost technology 2.0. http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost.Google Scholar
Isci, C., Contreras, G., and Martonosi, M. 2006. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In Proceedings of the International Symposium on Microarchitecture. 359--370. Google ScholarDigital Library
Janzen, J. 2010. The Micron power calculator. http://www.micron.com/products/dram/syscalc.html.Google Scholar
Jiao, Y., Lin, H., Balaji, P., and Feng, W. 2010. Power and performance characterization of computational kernels on the GPU. In Proceedings of the International Conference on Green Computing and Communications. 221--228. Google ScholarDigital Library
Kaxiras, S. and Martonosi, M. 2009. Computer Architecture Techniques for Power-Efficiency, Morgan and Claypool. Google ScholarDigital Library
Kgil, T., D’Souza, S., Saidi, A., Binkert, N., Dreslinski, R., Mudge, T., Reinhardt, S., and Flautner, K. 2006. PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 117--128. Google ScholarDigital Library
Khan, N., Yoon, S. W., Viswanath, A. G. K., Ganesh, V. P., Ranganathan, D. W., Lim, S., and Vaidyanathan, K. 2006. Development of 3D stack package using silicon interposer for high power application. In Proceedings of the Electronic Components and Technology Conference. 1--5.Google Scholar
Kim, J., Oh, C. S., Lee, H., Lee, D., Hwang, H., Hwang, S., Na, B., Moon, J., Kim, J., Park, H., Ryu, J., Park, K., Kang, S., Kim, S., Kim, H., Bang, J., Cho, H., Jang, M., Han, C., Lee, J., Kyung, K., Choi, J., and Jun, Y. 2011a. A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 496--498.Google Scholar
Kim, N., Wu, D., Kim, D., Rahman, A., and Wu, P. 2011b. Interposer design optimization for high frequency signal transmission in passive and active interposer using through silicon via (TSV). In Proceedings of the Electronic Components and Technology Conference. 1160--1167.Google Scholar
Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarDigital Library
Liu, C. C., Burtscher, M., and Tiwari, S. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Des. Test 22, 6, 556--564. Google ScholarDigital Library
Loh, G. H. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the International Symposium on Computer Architecture. 453--464. Google ScholarDigital Library
Loi, G. L., Agrawal, B., Srivastava, N., Lin, S., Sherwood, T., and Banerjee, K. 2006. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the Design Automation Conference. 991--996. Google ScholarDigital Library
Micron. 2013. Hybrid Memory Cube specification 1.0. http://hybridmemorycube.org/files/SiteDownloads/HMC_Specification&percnt;201_0. pdf.Google Scholar
NVIDIA. 2008. PowerMizer 8.0 Intelligent Power Management Technology. Tech. brief. http://www.nvidia.com/object/feature_powermizer.html.Google Scholar
NVIDIA. 2010. Quadro 6000—Workstation graphics card for 3D design, styling, visualization, CAD, and more. http://www.nvidia.com/object/product-quadro-6000-us.html.Google Scholar
Ren, D. Q. and Suda, R. 2010. Modeling and optimizing the power performance of large matrices multiplication on multi-core and GPU platform with CUDA. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. 421--428. Google ScholarDigital Library
Samsung. 2010. DDR3 and GDDR5. http://www.samsung.com/global/business/semiconductor/products/Products.html.Google Scholar
Skadron, K., Stan, M. R., Sankaranarayanan, K., Huang, W., Velusamy, S., and Tarjan, D. 2004. Temperature-aware microarchitecture: Modeling and implementation. ACM T. Archit. Code Op. 1, 1, 94--125. Google ScholarDigital Library
Sunohara, M., Tokunaga, T., Kurihara, T., and Higashi, M. 2008. Silicon interposer with TSVs (through silicon vias) and fine multilayer wiring. In Proceedings of the Electronic Components and Technology Conference. 847--852.Google Scholar
Tezzaron. 2010. FaStack 3D stackable DRAM. http://www.tezzaron.com/memory/FaStack_memory.html.Google Scholar
Vick, E., Goodwin, S., Cunnigham, G., and Temple, D. S. 2012. Vias-last process technology for thick 2.5D Si interposers. In Proceedings of the 3D Systems Integration Conference. 1--4.Google Scholar
Wang, P., Cheng, Y., Yang, C., and Cheng, Y. 2009. A predictive shutdown technique for GPU Shader processors. IEEE Comput. Archit. Lett. 8, 1, 9--12. Google ScholarDigital Library
Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Conference for High Performance Computing. 1--12.Google Scholar
Wu, Q., Martonosi, M., Clark, D. W., Reddi, V. J., Connors, D., Wu, Y., Lee, J., and Brooks, D. 2005. A dynamic compilation framework for controlling microprocessor energy and performance. In Proceedings of the International Symposium on Microarchitecture. 271--282. Google ScholarDigital Library
Yu, W. S., Huang, S. Q., Wang, S., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. 247--258. Google ScholarDigital Library
Zhao, J., Sun, G., Loh, G. H., and Xie, Y. 2012. Energy-efficient GPU design with reconfigurable in-package graphics memory. In Proceedings of the International Symposium on Low Power Electronics and Design. 403--408. Google ScholarDigital Library

Index Terms

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Energy-efficient GPU design with reconfigurable in-package graphics memory
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design

We propose an energy-efficient reconfigurable in-package graphics memory design that integrates wide-interface graphics DRAMs with GPU on a silicon interposer. We reduce the memory power consumption by scaling down the supply voltage and frequency while ...
Read More
Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs ...
Read More
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 10, Issue 4
December 2013
1046 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2541228
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2013
- Accepted: 1 August 2013
- Revised: 1 July 2013
- Received: 1 April 2013
Published in taco Volume 10, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3D ICs
3D packaging
Graphics memory
reconfigurable interface
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 1,294
  Total Downloads
- Downloads (Last 12 months)177
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient GPU design with reconfigurable in-package graphics memory

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient GPU design with reconfigurable in-package graphics memory

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media