Abstract
Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for highend servers. Although the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage's proportion of total chip power will increase significantly. This article examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and "turning off" cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of "dead time" before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce L1 cache leakage energy by 4x in SPEC2000 applications without having an impact on performance. Because our decay-based techniques have notions of competitive online algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce L1 cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.
- Albonesi, D. H. 1999. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd International Symposium on Microarchitecture. Google Scholar
- Baer, J. and Wang, W. 1988. On the inclusion property in multi-level cache hierarchies. In Proceedings of the Fifteenth International Symposium on Computer Architecture. Google Scholar
- Borkar, S. 1999. Design challenges of technology scaling. IEEE Micro 19, 4. Google Scholar
- Bowhill, W. J., Bell, S. L., Benschneider, B. J., Black, A. J., Britton, S. M., Castelino, R. W., Donchin, D. R., Edmondson, J. H., Fair, III, H. R., Gronowski, P. E., Jain, A. K., Kroesen, P. L., Lamere, M. E., Loughlin, B. J., Mehta, S., Mueller, R. O., Preston, R. P., Santhanam, S., Shedd, T. A., Smith, M. J., and Thierauf, S. C. 1995. Circuit implementation of a 300-MHz 64-bit second-generation CMOS Alpha CPU. Dig. Tech. J. 7, 1, 100--118. Google Scholar
- Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architecture-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000). Google Scholar
- Burger, D., Austin, T. M., and Bennett, S. 1996. Evaluating future microprocessors: The SimpleScalar tool set. Tech. Rep. TR-1308 (July), Univ. of Wisconsin---Madison Computer Sciences Dept.Google Scholar
- Burger, D., Goodman, J., and Kagi, A. 1995. The declining effectiveness of dynamic caching for general-purpose microprocessors. Tech. Rep. TR-1216, Univ. of Wisconsin---Madison Computer Sciences Dept.Google Scholar
- Butts, J. A. and Sohi, G. 2000. A static power model for architects. In Proceedings of the 33rd International Symposium on Microarchitecture (Dec.). Google Scholar
- Chen, Z., Johnson, M., Wei, L., and Roy, K. 1998. Estimation of standby leakage power in CMOS circuits considering accurate modeling of transistor stacks. In ISLPED. Google Scholar
- Dean, J., Hicks, J. E., Waldspurger, C. A., Weihl, W. E., and Chrysos, G. 1997. Profileme: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of the Thirtieth International Symposium on Microarchitecture. Google Scholar
- Delaluz, V., Kandemir, M., Vijaykrishnan, N., Sivasubramaniam, A., and Irwin, M. J. 2001. Dram energy management using software and hardware directed power mode control. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (Feb.), 84--93. Google Scholar
- Gwennap, L. 1996. Digital 21264 sets new standard. Microproc. Rep. 11--16.Google Scholar
- Hallnor, E. G. and Reinhardt, S. K. 2000. A fully associative software-managed cache design. In Proceedings of the 27th International Symposium on Computer Architecture (June). Google Scholar
- IBM Corp. 2000. Personal communication. November.Google Scholar
- Intel Corp. 1997. Intel architecture optimization manual.Google Scholar
- Kamble, M. B. and Ghose, K. 1997. Analytical energy dissipation models for low power caches. In Proceedings of the International Symposium on Low Power Electronics and Design. Google Scholar
- Karlin, A. R., Li, K., and Manasse, M. S. 1991. Empirical studies of competitive spinning for a shared-memory multiprocessor. In Proceedings of the Eleventh ACM Symposium on Operating System Principles. Google Scholar
- Kimbrel, T. and Karlin, A. 2000. Near-optimal parallel prefetching and caching. SIAM J. Comput. Google Scholar
- Lai, A.-C. and Falsafi, B. 2000. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th International Symposium on Computer Architecture (May). Google Scholar
- Lebeck, A., Fan, X., Zeng, H., and Ellis, C. 2000. Power aware page allocation. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, 105--116. Google Scholar
- Lebeck, A. R. and Wood, D. A. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd International Symposium on Computer Architecture (June). Google Scholar
- Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communication systems. In Proceedings of the Thirtieth International Symposium on Microarchitecture (Dec.). Google Scholar
- Lee, H.-H., Tyson, G. S., and Farrens, M. 2000. Eager writeback---A technique for improving bandwidth utilization. In Proceedings of the 33rd International Symposium on Microarchitecture (Dec.). Google Scholar
- Peir, J., Lee, Y., and Hsu, W. 1998. Capturing dynamic memory reference behavior with adaptive cache topology. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (Nov.). Google Scholar
- Powell, M. D., Yang, S.-H., Falsafi, B., Roy, K., and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the International Symposium on Low Power Electronics and Design '00. Google Scholar
- Romer, T., Ohlrich, W., Karlin, A., and Bershad, B. 1995. Reducing TLB and memory overhead using online superpage promotion. In Proceedings of the 22nd International Symposium on Computer Architecture. Google Scholar
- Sair, S. and Charney, M. 2000. Memory behavior of the SPEC2000 benchmark suite. Tech. Rep., IBM.Google Scholar
- Semiconductor Industry Association. 1999. The International Technology Roadmap for Semiconductors. Available at http://www.semichips.org.Google Scholar
- Stallings, W. 2001. Operating Systems. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
- The Standard Performance Evaluation Corporation. 2000. WWW Site. http://www.spec.org.Google Scholar
- U.S. Environmental Protection Agency. 2001. Energy Star Program Web page. http://www. epa.gov/energystar/.Google Scholar
- Wilson, K. M. and Olukotun, K. 1997. Designing high bandwidth on-chip caches. In Proceedings of the 24th International Symposium on Computer Architecture (June), 121--132. Google Scholar
- Wood, D. A., Hill, M. D., and Kessler, R. E. 1991. A model for estimating trace-sample miss ratios. In Proceedings of ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (June), 79--89. Google Scholar
- Yang, S.-H., Powell, M. D., Falsafi, B., Roy, K., and Vijaykumar, T. 2001. An integrated circuit/ architecture approach to reducing leakage in deep-submicron high-performance I-caches. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture. Google Scholar
- Yeh, T. N. and Patt, Y. 1993. A comparison of dynamic branch predictors that use two levels of branch history. In Proceedings of the Twentieth International Symposium on Computer Architecture (May). Google Scholar
- Zagha, M., Larson, B., Turner, S., and Itzkowitz, M. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceedings of Supercomputing. Google Scholar
Index Terms
- Let caches decay: reducing leakage energy via exploitation of cache generational behavior
Recommendations
Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties
As transistors keep shrinking and on-chip caches keep growing, static power dissipation resulting from leakage of caches takes an increasing fraction of total power in processors. Several techniques have already been proposed to reduce leakage power by ...
Way adaptable D-NUCA caches
Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...
Smaller Split L-1 Data Caches for Multi-core Processing Systems
ISPAN '09: Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and NetworksAs more cores (processing elements) are included in a single chip, it is likely that the sizes of per core L-1 caches will become smaller while more cores will share L-2 cache resources. It becomes more critical to improve the use of L-1 caches and ...
Comments