ABSTRACT
Effective use of the memory hierarchy is critical for achieving high performance on embedded systems. We focus on the class of streaming applications, which is increasingly prevalent in the embedded domain. We exploit the widespread parallelism and regular communication patterns in stream programs to formulate a set of cache aware optimizations that automatically improve instruction and data locality. Our work is in the context of the Synchronous Dataflow model, in which a program is described as a graph of independent actors that communicate over channels. The communication rates between actors are known at compile time, allowing the compiler to statically model the caching behavior.We present three cache aware optimizations: 1) execution scaling, which judiciously repeats actor executions to improve instruction locality, 2) cache aware fusion, which combines adjacent actors while respecting instruction cache constraints, and 3) scalar replacement, which converts certain data buffers into a sequence of scalar variables that can be register allocated. The optimizations are founded upon a simple and intuitive model that quantifies the temporal locality for a sequence of actor executions. Our implementation of cache aware optimizations in the StreamIt compiler yields a 249% average speedup (over unoptimized code) for our streaming benchmark suite on a StrongARM 1110 processor. The optimizations also yield a 154% speedup on a Pentium 3 and a 152% speedup on an Itanium 2.
- A. Benveniste, P. Caspi, P. L. Guernic, and N. Halbwachs. Data-Flow Synchronous Languages. In REX School/Symposium, pages 1--45, 1993. Google ScholarDigital Library
- G. Berry and G. Gonthier. The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Prog., 19(2), 1992. Google ScholarDigital Library
- C. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. APGAN and RPMC: Complementary Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations. Journal of Design Automation for Embedded Systems., pages 33--60, January 1997.Google Scholar
- S. Bhattacharyya, P. Murthy, and E. Lee. Synthesis of embedded software from synchronous dataflow specifications. Journal of VLSI Signal Processing Systems, 21(2), June 1999. Google ScholarDigital Library
- S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, 1996. Google ScholarDigital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. In SIGGRAPH, 2004. Google ScholarDigital Library
- R. Govindarajan, G. Gao, and P. Desai. Minimizing memory requirements in rate-optimal schedules. In Proceedings of the 1994 Int. conference on Application Specific Array Processors, pages 75--86, August 1994.Google ScholarCross Ref
- N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proc. of the IEEE, 79(9), 1991.Google ScholarCross Ref
- J. Gaudiot and W. Bohm and T. DeBoni and J. Feo and P. Mille. The Sisal Model of Functional Programming and its Implementation. In Proc. of the Second Aizu International Symposium on Parallel Algorithms/Architectures Synthesis, 1997. Google ScholarDigital Library
- M. Karczmarek, W. Thies, and S. Amarasinghe. Phased scheduling of stream programs. In LCTES, 2003. Google ScholarDigital Library
- S. Kohli. Cache aware scheduling of synchronous dataflow programs. Master's Report Technical Memorandum UCB/URL M04/03, UC Berkeley, 2004.Google Scholar
- E. A. Lee. Overview of the Ptolemy Project. Technical report, Tech Memo UCB/ERL M03/25, UC Berkeley, 2003.Google Scholar
- E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, January 1987. Google ScholarDigital Library
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 1970.Google Scholar
- P. K. Murthy and S. S. Bhattacharyya. A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications. In International Symposium on System Synthesis, 1999. Google ScholarDigital Library
- P. K. Murthy and S. S. Bhattacharyya. Buffer Merging --- A Powerful Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications. Technical report, Inst. for Adv. Computer Studies, UMD College Park, 2000.Google Scholar
- T. A. Proebsting and S. A. Watterson. Filter Fusion. In POPL, 1996. Google ScholarDigital Library
- J. Sermulins. In preparation. Master's thesis, MIT CSAIL, 2005.Google Scholar
- R. Stephens. A Survey of Stream Processing. Acta Informatica, 34(7), 1997.Google Scholar
- D. Tennenhouse and V. Bose. The SpectrumWare Approach to Wireless Signal Processing. Wireless Networks, 1999. Google ScholarDigital Library
- W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Proc. of the Int. Conf. on Compiler Construction (CC), 2002. Google ScholarDigital Library
Index Terms
- Cache aware optimization of stream programs
Recommendations
Cache aware optimization of stream programs
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsEffective use of the memory hierarchy is critical for achieving high performance on embedded systems. We focus on the class of streaming applications, which is increasingly prevalent in the embedded domain. We exploit the widespread parallelism and ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Reducing cache misses through programmable decoders
Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-...
Comments