Abstract
The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU systems is dramatically increased. Up to 30% of the total power of a GPU system is consumed by the graphic memory itself. Therefore, reducing graphics memory power consumption is critical to mitigate the power challenge.
In this article, we propose an energy-efficient reconfigurable 3D die-stacking graphics memory design that integrates wide-interface graphics DRAMs side-by-side with a GPU processor on a silicon interposer. The proposed architecture is a “3D+2.5D” system, where the DRAM memory itself is 3D stacked memory with through-silicon via (TSV), whereas the integration of DRAM and the GPU processor is through the interposer solution (2.5D). Since GPU computing units, memory controllers, and memory are all integrated in the same package, the number of memory I/Os is no longer constrained by the package’s pin count. We can reduce the memory power consumption by scaling down the supply voltage and frequency of memory interface while maintaining the same or even higher peak memory bandwidth. In addition, we design a reconfigurable memory interface that can dynamically adapt to the requirements of various applications. We propose two reconfiguration mechanisms to optimize the GPU system energy efficiency and throughput, respectively, and thus benefit both memory-intensive and compute-intensive applications. The experimental results show that the proposed GPU memory architecture can effectively improve GPU system energy efficiency by 21%, without reconfiguration. The reconfigurable memory interface can further improve the system energy efficiency by 26%, and system throughput by 31% under a capped system power budget of 240W.
- Akazawa, M., Kuramochi, S., Maruyama, T., Nakayama, K., Takano, A., Yamaguchi, M., and Fukuoka, Y. 2003. High-density packaging technologies on silicon substrates. In Proceedings of the Electronic Components and Technology Conference. 647--651.Google Scholar
- Al Maashri, A., Sun, G., Dong, X., Narayanan, V., and Xie, Y. 2009. 3D GPU Architecture Using Cache Stacking: Performance, Cost, Power and Thermal Analysis. In Proceedings of the International Conferenece on Computer Design. 254--259. Google ScholarDigital Library
- AMD. 2012. AMD Radeon#8482; HD 7970 Graphics. http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx.Google Scholar
- Andry, P. S., Tsang, C., Sprogis, E., Patel, C., Wright, S. L., Webb, B. C., Buchwalter, L. P., Manzer, D., Horton, R., Polastre, R., and Knickerbocker, J. 2006. A CMOS-compatible process for fabricating electrical through-vias in silicon. In Proceedings of the Electronic Components and Technology Conference. 1--7.Google Scholar
- Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google Scholar
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization. 44--54. Google ScholarDigital Library
- CUDASDK. 2010. GPU Computing SDK. (2010). https://developer.nvidia.com/gpu-computing-sdk.Google Scholar
- David, H., Fallin, C., Gorbatov, E., Hanebutte, U. R., and Mutlu, O. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of the International Conference on Autonomic Computing. 31--40. Google ScholarDigital Library
- Deng, D., Meisner, D., Bhattacharjee, A., Wenisch, T. F., and Bianchini, R. 2012. CoScale: Coordinating CPU and memory system DVFS in server systems. In Proceedings of the International Symposium on Microarchitecture. 143--154. Google ScholarDigital Library
- Deng, Q., Meisner, D., Ramos, L., Wenisch, T.  F., and Bianchini, R. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 225--238. Google ScholarDigital Library
- Dong, X., Xie, Y., Muralimanohar, N., and Jouppi, N. P. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing. 1--11. Google ScholarDigital Library
- Dorsey, P. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs, WP 380, October 27, 2010, 1--10.Google Scholar
- Elpida. 2010. Introduction to GDDR5 SGRAM. http://www.elpida.com/pdfs/E1600E10.pdf.Google Scholar
- Galal, S. and Horowitz, M. 2011. Energy-efficient floating-point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarDigital Library
- Gebhart, M., Johnson, D. R., Tarjan, D., Keckler, S. W., Dally, W. J., Lindholm, E., and Skadron, K. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture. 235--246. Google ScholarDigital Library
- Gu, S. Q., Marchal, P., Facchini, M., Wang, F., Suh, M., Lisk, D., and Nowak, M. 2008. Stackable memory of 3D chip integration for mobile applications. In Proceedings of the International Electron Devices Meeting. 1--4.Google Scholar
- Herbert, S. and Marculescu, D. 2007. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. 38--43. Google ScholarDigital Library
- Hynix. 2009. Hynix GDDR5 SGRAM datasheet, http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24 AFR(Rev1.0).pdf.Google Scholar
- Intel. 2005. Thermal protection and monitoring features: A software perspective. In Intel Software Network. 1--6.Google Scholar
- Intel. 2012. Turbo boost technology 2.0. http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost.Google Scholar
- Isci, C., Contreras, G., and Martonosi, M. 2006. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In Proceedings of the International Symposium on Microarchitecture. 359--370. Google ScholarDigital Library
- Janzen, J. 2010. The Micron power calculator. http://www.micron.com/products/dram/syscalc.html.Google Scholar
- Jiao, Y., Lin, H., Balaji, P., and Feng, W. 2010. Power and performance characterization of computational kernels on the GPU. In Proceedings of the International Conference on Green Computing and Communications. 221--228. Google ScholarDigital Library
- Kaxiras, S. and Martonosi, M. 2009. Computer Architecture Techniques for Power-Efficiency, Morgan and Claypool. Google ScholarDigital Library
- Kgil, T., D’Souza, S., Saidi, A., Binkert, N., Dreslinski, R., Mudge, T., Reinhardt, S., and Flautner, K. 2006. PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 117--128. Google ScholarDigital Library
- Khan, N., Yoon, S. W., Viswanath, A. G. K., Ganesh, V. P., Ranganathan, D. W., Lim, S., and Vaidyanathan, K. 2006. Development of 3D stack package using silicon interposer for high power application. In Proceedings of the Electronic Components and Technology Conference. 1--5.Google Scholar
- Kim, J., Oh, C. S., Lee, H., Lee, D., Hwang, H., Hwang, S., Na, B., Moon, J., Kim, J., Park, H., Ryu, J., Park, K., Kang, S., Kim, S., Kim, H., Bang, J., Cho, H., Jang, M., Han, C., Lee, J., Kyung, K., Choi, J., and Jun, Y. 2011a. A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 496--498.Google Scholar
- Kim, N., Wu, D., Kim, D., Rahman, A., and Wu, P. 2011b. Interposer design optimization for high frequency signal transmission in passive and active interposer using through silicon via (TSV). In Proceedings of the Electronic Components and Technology Conference. 1160--1167.Google Scholar
- Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarDigital Library
- Liu, C. C., Burtscher, M., and Tiwari, S. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Des. Test 22, 6, 556--564. Google ScholarDigital Library
- Loh, G. H. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the International Symposium on Computer Architecture. 453--464. Google ScholarDigital Library
- Loi, G. L., Agrawal, B., Srivastava, N., Lin, S., Sherwood, T., and Banerjee, K. 2006. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the Design Automation Conference. 991--996. Google ScholarDigital Library
- Micron. 2013. Hybrid Memory Cube specification 1.0. http://hybridmemorycube.org/files/SiteDownloads/HMC_Specification%201_0. pdf.Google Scholar
- NVIDIA. 2008. PowerMizer 8.0 Intelligent Power Management Technology. Tech. brief. http://www.nvidia.com/object/feature_powermizer.html.Google Scholar
- NVIDIA. 2010. Quadro 6000—Workstation graphics card for 3D design, styling, visualization, CAD, and more. http://www.nvidia.com/object/product-quadro-6000-us.html.Google Scholar
- Ren, D. Q. and Suda, R. 2010. Modeling and optimizing the power performance of large matrices multiplication on multi-core and GPU platform with CUDA. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. 421--428. Google ScholarDigital Library
- Samsung. 2010. DDR3 and GDDR5. http://www.samsung.com/global/business/semiconductor/products/Products.html.Google Scholar
- Skadron, K., Stan, M. R., Sankaranarayanan, K., Huang, W., Velusamy, S., and Tarjan, D. 2004. Temperature-aware microarchitecture: Modeling and implementation. ACM T. Archit. Code Op. 1, 1, 94--125. Google ScholarDigital Library
- Sunohara, M., Tokunaga, T., Kurihara, T., and Higashi, M. 2008. Silicon interposer with TSVs (through silicon vias) and fine multilayer wiring. In Proceedings of the Electronic Components and Technology Conference. 847--852.Google Scholar
- Tezzaron. 2010. FaStack 3D stackable DRAM. http://www.tezzaron.com/memory/FaStack_memory.html.Google Scholar
- Vick, E., Goodwin, S., Cunnigham, G., and Temple, D. S. 2012. Vias-last process technology for thick 2.5D Si interposers. In Proceedings of the 3D Systems Integration Conference. 1--4.Google Scholar
- Wang, P., Cheng, Y., Yang, C., and Cheng, Y. 2009. A predictive shutdown technique for GPU Shader processors. IEEE Comput. Archit. Lett. 8, 1, 9--12. Google ScholarDigital Library
- Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Conference for High Performance Computing. 1--12.Google Scholar
- Wu, Q., Martonosi, M., Clark, D. W., Reddi, V. J., Connors, D., Wu, Y., Lee, J., and Brooks, D. 2005. A dynamic compilation framework for controlling microprocessor energy and performance. In Proceedings of the International Symposium on Microarchitecture. 271--282. Google ScholarDigital Library
- Yu, W. S., Huang, S. Q., Wang, S., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. 247--258. Google ScholarDigital Library
- Zhao, J., Sun, G., Loh, G. H., and Xie, Y. 2012. Energy-efficient GPU design with reconfigurable in-package graphics memory. In Proceedings of the International Symposium on Low Power Electronics and Design. 403--408. Google ScholarDigital Library
Index Terms
- Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface
Recommendations
Energy-efficient GPU design with reconfigurable in-package graphics memory
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and designWe propose an energy-efficient reconfigurable in-package graphics memory design that integrates wide-interface graphics DRAMs with GPU on a silicon interposer. We reduce the memory power consumption by scaling down the supply voltage and frequency while ...
Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System
Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs ...
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesHybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of ...
Comments