ABSTRACT
In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded reconfigurable processor executes programs by switching between the Non-VLIW and VLIW modes depending on the type of the code region to achieve high performance. The VLIW mode exploits code regions with high ILP that require high memory bandwidth and the Non-VLIW mode exploits those with low ILP that require low memory latency. Our data partitioning technique between the SPM and the data cache is based on data interference graph reduction and profiling information. Given an SPM size, it finds the optimal data partitions by taking the VLIW instruction schedule into consideration. We evaluate our data partitioning technique for the CGRA architecture with three representative multimedia applications.
- Federico Angiolini, Luca Benini, and Alberto Caprara. Polynomial-time algorithm for on-chip scratchpad memory partitioning. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 318--326, 2003. Google ScholarDigital Library
- Oren Avissar, Rajeev Barua, and Dave Stewart. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst., 1(1):6--26, 2002. Google ScholarDigital Library
- Kristof Beyls and Erik H. D'Hollander. Generating cache hints for improved program efficiency. J. Syst. Archit., 51(4):223--250, 2005. Google ScholarDigital Library
- CACTI 4.2. http://quid.hpl.hp.com:9081/cacti/, 2006.Google Scholar
- Hyungmin Cho, Bernhard Egger, Jaejin Lee, and Heonshik Shin. Dynamic data scratchpad memory management for a memory subsystem with an mmu. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 195--206, 2007. Google ScholarDigital Library
- Intel Corporation. Intel Itanium 2 Processor Reference Manual For Software Development and Optimization. 2004.Google Scholar
- Eddy De Greef, Francky Catthoor, and Hugo De Man. Array placement for storage size reduction in embedded multimedia systems. In ASAP '97: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pages 66--, 1997. Google ScholarDigital Library
- Angel Dominguez, Nghi Nguyen, and Rajeev K. Barua. Recursive function data allocation to scratch-pad memory. In CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 65--74, 2007. Google ScholarDigital Library
- Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005. Google ScholarDigital Library
- Michael R. Garey and David S. Johnson. Computers and Intractability. Freeman, 1979.Google ScholarDigital Library
- Antonio González, Carlos Aliagas, and Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS '95: Proceedings of the 9th international conference on Supercomputing, pages 338--347, 1995. Google ScholarDigital Library
- AMD Inc. Software Optimization Guide for AMD64 Processors. 2005.Google Scholar
- Texas Instruments Incoporated. Tms320c6000 high performance dsps. http://www.ti.com, 2006.Google Scholar
- ISO/IEC. IS 13818--3 Information Technology - Generic Coding of Moving Pictures and Associated Audio: Audio. 1996. MP3.Google Scholar
- ISO/IEC. IS 14496--10 Information Technology - Coding of Audio Visual Objects: Advanced Video Coding. 2005. H.264.Google Scholar
- ISO/IEC. IS 14496--3 Information Technology - Coding of Audio Visual Objects: Audio. 2005. AAC.Google Scholar
- Toni Juan, Juan J. Navarro, and Olivier Temam. Data caches for superscalar processors. In ICS '97: Proceedings of the 11th international conference on Supercomputing, pages 60--67, 1997. Google ScholarDigital Library
- Hsien-Hsin S. Lee and Gary S. Tyson. Region-based caching: an energy-delay efficient memory architecture for embedded processors. In CASES '00: Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems, pages 120--127, 2000. Google ScholarDigital Library
- Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, and Christos Kozyrakis. Comparing memory systems for chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 358--368, 2007. Google ScholarDigital Library
- ARM Limited. RealView SoC Designer 6.2,. http://www.arm.com/products/DevTools/SoCDesigner.html.Google Scholar
- Guangming Lu, Hartej Singh, Ming-Hau Lee, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. The morphosys parallel reconfigurable system. In Euro-Par '99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing, pages 727--734, 1999. Google ScholarDigital Library
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In DATE '03: Proceedings of the conference on Design, Automation and Test in Europe, page 10296, 2003. Google ScholarDigital Library
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, and Rudy Lauwereins. Design methodology for a tightly coupled vliw/reconfigurable matrix architecture: A case study. In DATE '04: Proceedings of the conference on Design, automation and test in Europe, page 21224, 2004. Google ScholarDigital Library
- Wilfried Oed and O. Lange. On the effective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput., 34(10):949--957, 1985. Google ScholarDigital Library
- Taewook Oh, Bernhard Egger, Hyunchul Park, and Scott Mahlke. Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 21--30, 2009. Google ScholarDigital Library
- Hyunchul Park, Kevin Fan, Manjunath Kudlur, and Scott Mahlke. Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 136--146, 2006. Google ScholarDigital Library
- Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 166--176, 2008. Google ScholarDigital Library
- Yongjun Park, Hyunchul Park, and Scott Mahlke. Cgra express: accelerating execution using dynamic operation fusion. In CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pages 271--280, 2009. Google ScholarDigital Library
- Ram Raghavan and John P. Hayes. Reducing interference among vector accesses in interleaved memories. IEEE Trans. Comput., 42(4):471--483, 1993. Google ScholarDigital Library
- B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In MICRO 27: Proceedings of the 27th annual international symposium on Microarchitecture, pages 63--74, 1994. Google ScholarDigital Library
- Rajiv Ravindran, Michael Chu, and Scott Mahlke. Compiler-managed partitioned data caches for low power. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 237--247, 2007. Google ScholarDigital Library
- Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. On high-bandwidth data cache design for multi-issue processors. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 46--56, 1997. Google ScholarDigital Library
- Julio Sahuquillo, Salvador Petit, Ana Pont, and Veljko Milutinović. Exploring the performance of split data cache schemes on superscalar processors and symmetric multiprocessors. J. Syst. Archit., 51(8):451--469, 2005. Google ScholarDigital Library
- Jesús Sánchez and Antonio González. A locality sensitive multi-module cache with explicit management. In ICS '99: Proceedings of the 13th international conference on Supercomputing, pages 51--59, 1999. Google ScholarDigital Library
- Aviral Shrivastava, Ilya Issenin, and Nikil Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pages 90--96, 2005. Google ScholarDigital Library
- Gurindar S. Sohi and Manoj Franklin. High-bandwidth data memory systems for superscalar processors. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pages 53--62, 1991. Google ScholarDigital Library
- Stefan Steinke, Lars Wehmeyer, Bo-Sik Lee, and Peter Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, page 409, 2002. Google ScholarDigital Library
- Tensilica Inc. Xtensa customizable processors. http://www.tensilica.com, 2007.Google Scholar
- Remko Tronçon, Maurice Bruynooghe, Gerda Janssens, and Francky Catthoor. Storage size reduction by in-place mapping of arrays. In VMCAI '02: Revised Papers from the Third International Workshop on Verification, Model Checking, and Abstract Interpretation, pages 167--181, 2002. Google ScholarDigital Library
- Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A modified approach to data cache management. In MICRO 28: Proceedings of the 28th annual international symposium on Microarchitecture, pages 93--103, 1995. Google ScholarDigital Library
- Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 276--286, 2003. Google ScholarDigital Library
- Osman S. Unsal, Israel Koren, C. Mani Krishna, and Csaba Andras Moritz. The minimax cache: An energy-efficient framework for media processors. In HPCA '02: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, page 131, 2002. Google ScholarDigital Library
- Manish Verma, Stefan Steinke, and Peter Marwedel. Data partitioning for maximal scratchpad usage. In ASP-DAC '03: Proceedings of the 2003 Asia and South Pacific Design Automation Conference, pages 77--83, 2003. Google ScholarDigital Library
- Lars Wehmeyer, Urs Helmig, and Peter Marwedel. Compiler-optimized usage of partitioned memories. In WMPI '04: Proceedings of the 3rd workshop on Memory performance issues, pages 114--120, 2004. Google ScholarDigital Library
Index Terms
- An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
Recommendations
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
LCTES '10In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded ...
Fast, frequency-based, integrated register allocation and instruction scheduling
Instruction scheduling and register allocation are two of the most important optimization phases in modern compilers as they have a significant impact on the quality of the generated code. Unfortunately, the objectives of these two optimizations are in ...
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance ...
Comments