article

Power-efficient prefetching for embedded processors

Authors:
Xiaotong Zhuang

Georgia Institute of Technology, Atlanta, GA

Georgia Institute of Technology, Atlanta, GA
View Profile

,
Santosh Pande

Georgia Institute of Technology, Atlanta, GA

Georgia Institute of Technology, Atlanta, GA
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 6 Issue 1pp 3–eshttps://doi.org/10.1145/1210268.1210271

Published:01 February 2007Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer [Wilson et al. 1996]. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.

References

ARM Co.Ltd, ARM 7TDMI Data Sheet.Google Scholar
ARM Co.Ltd, ARM7500FE Data Sheet.Google Scholar
Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers Principles, Techniques and Tools, Addison-Wesley, Reading, MA. Google Scholar
Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. ISCA'00 (June). Google ScholarCross Ref
Basu, K., Choudhary, A., Pisharath, J., and Kandemir, M. 2002. Power protocol: Reducing power dissipation on off-chip data buses. MICRO (Nov.). Google Scholar
Burger, D. and Austin, T. M. 1997. The SimpleScalar tool set version 2.0. Tech. Report 1342, Univ. of Wisconsin--Madison (May).Google Scholar
Calder, B., Krintz, C., John, S., and Austin, T. 1998. Cache-conscious data placement. In Proceedings of Architectural Support for Programming Languages and Operating Systems. (Oct). Google Scholar
Luk Chi-Keung and Mowry, T. C. 1996. Compiler based prefetching for recursive data structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), 222--233 (Oct.). Google Scholar
Cho, S., Yew, P. C., and Lee, G. 1999. Decoupling local variable accesses in a wide-issue superscalar processor. ISCA, (May). Google Scholar
Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. MiBench: A free, commercially representative embedded benchmark suite. IEEE 4th Annual Workshop on Workload Characterization. Google Scholar
Haber, G., Klausner, M., Eisenberg, V., Mendelson, B., and Gurevich, M. 2003. Optimization opportunities created by global data reordering. In Proceedings of International Symposium on Code Generation and Optimization. Google Scholar
Intel Corp. SA-110 Microprocessor Tech. Ref. Manual.Google Scholar
Lee, H. S., Smelyanskiy, M., Newburn, C. J., and Tyson, G. S. 2001. Stack value file: Custom microarchitecture for the stack. HPCA-7, (Jan). Google Scholar
Leupers, R. and Marwedel, P. 1996. Algorithm for address assign-ment in DSP code generation. In Proc. ICCAD. Google Scholar
Liao, S., Devadas, S., Keutzer, K., Tjiang, S., and Wang, A. 1996. Storage assignment to decrease code size. ACM TOPLAS 18, 3, (May), 235--253. Google Scholar
Noth, W. and Kolla, R. 1990. Spanning tree-based state encoding for low-power dissipation. DATE, 168--174.Google Scholar
Lipasti, M. H., Schmidt, W. J., Kunkel, S. R., and Roediger, R. R. 1995. Spaid: Software prefetching in pointer-and call-intensive environments. In Proceedings of the 28th Annual IEEE/ACM International Symposium on Microarchi-tecture (MICRO 28), (Nov.). 231--236. Google Scholar
Ozawa, T., Kimura, Y., and Nishiza-ki, S. 1995. Cache miss heuristics and preloading techniques for general-purpose programs. In Proceedings of the 28th Annual IEEE/ACM International Symposium on Microarchitecture. (MICRO 28) (Nov.). Google Scholar
Panda, P. R. and Dutt, N. D. 1999. Low-power memory mapping through reducing address bus activity. IEEE Transactions on VLSI Systems 7, 3 (Sept.). Google ScholarDigital Library
Perez, D. G., Mouchard, G., and Temam, O. 2004. MicroLib: A case for the quantitative comparison of micro-architecture mechanisms. MICRO 37. Google Scholar
Pomerene, J., Puzak, T., Rechtschaffen, R., and Sparacio, F. 1989. Prefetching system for a cache having a second directory for sequentially accessed blocks. U. S. Patent number 4,807,110 (Feb.).Google Scholar
Rao, A. and Pande, S. 1999. Storage assignment optimizations to generate compact and efficient code on embedded DSPs. In ACM (PLDI). 128--138. Google Scholar
Segars, S. 2001. Low power design techniques for microprocessors. ISSCC (Feb.).Google Scholar
Smith, A. J. 1982. Cache memories. Computing Surveys. 14, 3 (Sept.). Google ScholarDigital Library
Udayanarayanan, S. and Chakrabarti, C. 2001. Address code generation for DSPs. DAC.Google Scholar
Wagner, A. and Corneil, D. G. 1990. Embedding trees in a hypercube is NP-complete. SIAM J. Computing 18, 3, 570--590. Google ScholarDigital Library
Witchel, E., Larsen, S., Ananian, C. S., and Asanovi'c, K. 2001. Direct addressed caches for reduced power consumption. MICRO'01 (Dec.). Google Scholar
Wilson, K. M., Olukotun, K., and Rosenblum, M. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. ISCA. Google Scholar
Wilton, S. and Jouppi, N. P. 1993. An enhanced access and cycle time model for on-chip caches. Technical Report TN93/5, Compaq Western Research Lab.Google Scholar
Zhuang, X., Lau, C., and Pande, S. 2003. Storage assignment optimizations through variable coalescence for embedded processors. In Proceedings of ACM SIGPLAN Conference on Languages, Compiler, and Tools for Embedded Systems (LCTES-03). Google Scholar
Zhuang, X. and Pande, S. 2004. Power-efficient prefetching via bit-differential offset assignment on embedded processors. In Proceedings of ACM SIGPLAN Conference on Languages, Compiler, and Tools for Embedded Systems (LCTES-04) (June). Google Scholar

Index Terms

Power-efficient prefetching for embedded processors
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Power-efficient prefetching via bit-differential offset assignment on embedded processors
LCTES '04: Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

Due to stringent power constraints, aggressive latency hiding approaches such as prefetching are absent in the state-of-the-art embedded processors. There are two main reasons that cause prefetching to be power inefficient. First, compiler inserted ...
Read More
Power-efficient prefetching via bit-differential offset assignment on embedded processors
LCTES '04

Due to stringent power constraints, aggressive latency hiding approaches such as prefetching are absent in the state-of-the-art embedded processors. There are two main reasons that cause prefetching to be power inefficient. First, compiler inserted ...
Read More
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Embedded Computing Systems Volume 6, Issue 1
February 2007
210 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/1210268
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 1 February 2007
Published in tecs Volume 6, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data prefetching
bit-differential addressing
embedded processors
offset assignment
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 714
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Power-efficient prefetching for embedded processors

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Power-efficient prefetching via bit-differential offset assignment on embedded processors

Power-efficient prefetching via bit-differential offset assignment on embedded processors

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Power-efficient prefetching for embedded processors

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Power-efficient prefetching via bit-differential offset assignment on embedded processors

Power-efficient prefetching via bit-differential offset assignment on embedded processors

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media